Skip to content

pyspark

PySpark Daily Summary II

Continuing on where we left off last post, I'll be exploring pypspark on a daily basis, just to get more used to it. Here I will be posting summaries that cover roughtly 10 days worth of posts that I make on Kaggle, so that would equate to three posts a month

PySpark Daily Summary I

Something I decided would be fun to do on a daily basis; write pyspark code everyday and post about it, this is mainly because I don't use it as often as I would like, so this is my motivation. If you too want to join in, just fork the notebook (on Kaggle) and practice various pyspark codings everyday! Visit my telegram channel if you have any questions or just post them here!

Here I will be posting summaries that cover roughtly 10 days worth of posts that I make on Kaggle, so that would equate to three posts a month

Utilising Prophet with PySpark

In this notebook, we look at how to use a popular machine learning library prophet with the pyspark architecture. pyspark itself unfortunatelly does not contain such an additive regression model, however we can utilise user defined functions, UDF, which allows us to utilise different functionality of different libraries that is not available in pyspark

Hyperparameter Tuning with Pipelines


This post is the last of the three posts on the titanic classification problem in pyspark

  • In the last post, we started with a clearned dataset, which we prepared for machine learning, by utilising StringIndexer & VectorAssembler, and then the model training stage itself.
  • These steps are a series of stages in the construction of a model, which we can group into a single pipline. pyspark like sklearn has such pipeline classes that help us keep things organised

Training ML Models with PySpark


In this post, we will introduce ourselves to pyspark

  • We are continuing on from the previous post PySpark Titanic Preprocessing, where we did some basic data preprocessing, here we will continue on with the modeling stage of our project
  • We will be using spark.ml.classification to train binary classification models
  • There are quite a number of differences from pandas, for example the formulation of a VectorAssembler columns, which combines all column features into one

Data Preprocessing with PySpark


In this post, we will introduce ourselves to pyspark, a framework that allows us to work with big data

  • We'll look at how to start a spark_session
  • Setting up data types for the dataset using StructType
  • This post focuses on data preparation in the preprocessing state