August, 2023¶

August 22, 2023
in pyspark
5 min read

Hyperparameter Tuning with Pipelines

This post is the last of the three posts on the titanic classification problem in pyspark

In the last post, we started with a clearned dataset, which we prepared for machine learning, by utilising StringIndexer & VectorAssembler, and then the model training stage itself.
These steps are a series of stages in the construction of a model, which we can group into a single pipline. pyspark like sklearn has such pipeline classes that help us keep things organised

August 21, 2023
in pyspark
7 min read

Training ML Models with PySpark

In this post, we will introduce ourselves to pyspark

We are continuing on from the previous post PySpark Titanic Preprocessing, where we did some basic data preprocessing, here we will continue on with the modeling stage of our project
We will be using spark.ml.classification to train binary classification models
There are quite a number of differences from pandas, for example the formulation of a VectorAssembler columns, which combines all column features into one

August 20, 2023
in pyspark
6 min read

Data Preprocessing with PySpark

In this post, we will introduce ourselves to pyspark, a framework that allows us to work with big data

We'll look at how to start a spark_session
Setting up data types for the dataset using StructType
This post focuses on data preparation in the preprocessing state

August 19, 2023
in nlp
11 min read

Named Entity Recognition with Huggingface Trainer

In a previous post we looked at how we can utilise Huggingface together with PyTorch in order to create a NER tagging classifier. We did this by loading a preset encoder model & defined our own tail end model for our NER classification task. This required us to utilise Torch`, ie create more lower end code, which isn't the most beginner friendly, especially if you don't know Torch. In this post, we'll look at utilising only Huggingface, which simplifies the training & inference steps quite a lot. We'll be using the trainer & pipeline methods of the Huggingface library and will use a dataset used in mllibs, which includes tags for different words that can be identified as keywords to finding data source tokens, plot parameter tokens and function input parameter tokens.

August 10, 2023
in nlp
16 min read

Named Entity Recognition with Torch Loop

In this notebook, we'll take a look at how we can utilise HuggingFace to easily load and use BERT for token classification. Whilst we are loading both the base model & tokeniser from HuggingFace, we'll be using a custom Torch training loop and tail model customisation. The approach isn't the most straightforward but it is one way we can do it. We'll be utilising Massive dataset by Amazon and fine-tune the transformer encoder BERT