Skip to content

Index

Utilising Prophet with PySpark

In this notebook, we look at how to use a popular machine learning library prophet with the pyspark architecture. pyspark itself unfortunatelly does not contain such an additive regression model, however we can utilise user defined functions, UDF, which allows us to utilise different functionality of different libraries that is not available in pyspark

Comparison of Subsets

An important concept in machine learning is model generalisation & performance deterioration. When we train a model, we perform an optimisation step, using metrics and/or loss values we can understand how well our model is understanding the relation between all data points and features in the input data we feed it. Going through this process, we can tune a model so that it performs well on the data that we use to train it.

Hyperparameter Tuning with Pipelines

This post is the last of the three posts on the titanic classification problem in pyspark. In the last post, we started with a clearned dataset, which we prepared for machine learning, by utilising StringIndexer & VectorAssembler, and then the model training stage itself. These steps are a series of stages in the construction of a model, which we can group into a single pipline. pyspark like sklearn has such pipeline classes that help us keep things organised

Named Entity Recognition with Huggingface Trainer

In a previous post we looked at how we can utilise Huggingface together with PyTorch in order to create a NER tagging classifier. We did this by loading a preset encoder model & defined our own tail end model for our NER classification task. This required us to utilise Torch`, ie create more lower end code, which isn't the most beginner friendly, especially if you don't know Torch. In this post, we'll look at utilising only Huggingface, which simplifies the training & inference steps quite a lot. We'll be using the trainer & pipeline methods of the Huggingface library and will use a dataset used in mllibs, which includes tags for different words that can be identified as keywords to finding data source tokens, plot parameter tokens and function input parameter tokens.

Named Entity Recognition with Torch Loop

In this notebook, we'll take a look at how we can utilise HuggingFace to easily load and use BERT for token classification. Whilst we are loading both the base model & tokeniser from HuggingFace, we'll be using a custom Torch training loop and tail model customisation. The approach isn't the most straightforward but it is one way we can do it. We'll be utilising Massive dataset by Amazon and fine-tune the transformer encoder BERT

Exploring Taxi Trip Data

В этом посте мы будем использовать spark для раздедовательного анализа поездак такси для данных их города Чикаго. Когда данных становится слишком много, pandas работает медленее чем spark, поэтому будем использовать этот инструмент для разведовательного анализа!