PySpark Basics
PySpark Basics¶
Some simple and brief collection of pyspark operations
-
- Column selection
- Dropping columns
- Renaming of columns
-
- Missing data treatment classification pipeline
- Feature scaling using ScandardScaler classification pipeline
- TF-IDF corpus classification pipeline
- PCA dimensionality reduction classification pipeline
-
- Filtration by column value
- String related filtration using like / contains
- Missing data filtration
- List based filtration using isin
- General data clearning operations
-
- Aggregations without grouping
- Aggregations with grouping
- Filtering after grouping
-
- Basic pivot operation
- Pivot with multiple aggregations
- Conditional pivoting
- Pivoting with specified column values
Basic Classification Project¶
A classification project for beginners, shows how one can utilise pyspark in a machine learning project
-
Data Preprocessing with PySpark
- We'll look at how to start a spark_session
- Setting up data types for the dataset using StructType
- Focuses on data preparation in the preprocessing state
-
Training ML Models with PySpark
- Using spark.ml.classification to train binary classification models
- Introduction to StringIndexer, VectorAssembler
- Splitting dataset into subsets using .randomSplit
- Saving & loading models
-
Hyperparameter Tuning with Pipelines
- Using spark.ml.pipeline introduce a compact training approach
- Saving & loading pipelines
- Model evaluation using MulticlassClassificationEvaluator
- pyspark.ml.tuning for hyperparameter optimisation