Skip to content

pyspark

Training ML Models with PySpark


In this post, we will introduce ourselves to pyspark

  • We are continuing on from the previous post PySpark Titanic Preprocessing, where we did some basic data preprocessing, here we will continue on with the modeling stage of our project
  • We will be using spark.ml.classification to train binary classification models
  • There are quite a number of differences from pandas, for example the formulation of a VectorAssembler columns, which combines all column features into one

Data Preprocessing with PySpark


In this post, we will introduce ourselves to pyspark, a framework that allows us to work with big data

  • We'll look at how to start a spark_session
  • Setting up data types for the dataset using StructType
  • This post focuses on data preparation in the preprocessing state