Skip to content

mldsai

pyspark

shtrausslearning.github.io

pyspark¶

August 21, 2023
in pyspark
7 min read

Training ML Models with PySpark

In this post, we will introduce ourselves to pyspark

We are continuing on from the previous post PySpark Titanic Preprocessing, where we did some basic data preprocessing, here we will continue on with the modeling stage of our project
We will be using spark.ml.classification to train binary classification models
There are quite a number of differences from pandas, for example the formulation of a VectorAssembler columns, which combines all column features into one

August 20, 2023
in pyspark
6 min read

Data Preprocessing with PySpark

In this post, we will introduce ourselves to pyspark, a framework that allows us to work with big data

We'll look at how to start a spark_session
Setting up data types for the dataset using StructType
This post focuses on data preparation in the preprocessing state