Skip to content

PySpark Basics

PySpark Basics

Some simple and brief collection of pyspark operations

  • Select, Drop, Rename Columns


    • Column selection
    • Dropping columns
    • Renaming of columns
  • Pipelines


    • Missing data treatment classification pipeline
    • Feature scaling using ScandardScaler classification pipeline
    • TF-IDF corpus classification pipeline
    • PCA dimensionality reduction classification pipeline
  • Data Filtration


    • Filtration by column value
    • String related filtration using like / contains
    • Missing data filtration
    • List based filtration using isin
    • General data clearning operations
  • Aggregations


    • Aggregations without grouping
    • Aggregations with grouping
    • Filtering after grouping
  • Pivoting


    • Basic pivot operation
    • Pivot with multiple aggregations
    • Conditional pivoting
    • Pivoting with specified column values

Basic Classification Project

A classification project for beginners, shows how one can utilise pyspark in a machine learning project

  • Data Preprocessing with PySpark


    • We'll look at how to start a spark_session
    • Setting up data types for the dataset using StructType
    • Focuses on data preparation in the preprocessing state
  • Training ML Models with PySpark


    • Using spark.ml.classification to train binary classification models
    • Introduction to StringIndexer, VectorAssembler
    • Splitting dataset into subsets using .randomSplit
    • Saving & loading models
  • Hyperparameter Tuning with Pipelines


    • Using spark.ml.pipeline introduce a compact training approach
    • Saving & loading pipelines
    • Model evaluation using MulticlassClassificationEvaluator
    • pyspark.ml.tuning for hyperparameter optimisation