Index¶

December 20, 2023
in pyspark
6 min read

PySpark Daily Summary II

Continuing on where we left off last post, I'll be exploring pypspark on a daily basis, just to get more used to it. Here I will be posting summaries that cover roughtly 10 days worth of posts that I make on Kaggle, so that would equate to three posts a month

December 10, 2023
in pyspark
11 min read

PySpark Daily Summary I

Something I decided would be fun to do on a daily basis; write pyspark code everyday and post about it, this is mainly because I don't use it as often as I would like, so this is my motivation. If you too want to join in, just fork the notebook (on Kaggle) and practice various pyspark codings everyday! Visit my telegram channel if you have any questions or just post them here!

Here I will be posting summaries that cover roughtly 10 days worth of posts that I make on Kaggle, so that would equate to three posts a month

November 25, 2023
in code model
4 min read

Coding Linear Regression

Посмотрим на некий обзор главных моментов которые дадут нам возможность реализовать линейные модели в python и numpy. Посмотрим как отличается линейная регрессия от логистической, и как можно добавлять регуляризацию для этих моделей, чтобы можно было контролировать обобщающию способность модели. В этом разделе фокус на линейной регрессии

November 25, 2023
in code model
3 min read

Coding Logistic Regression

Посмотрим на некий обзор главных моментов которые дадут нам возможность реализовать линейные модели в python и numpy. Посмотрим как отличается линейная регрессия от логистической, и как можно добавлять регуляризацию для этих моделей, чтобы можно было контролировать обобщающию способность модели. В этом разделе фокус на логистической регрессии

November 20, 2023
in internship
27 min read

Prediction of Product Stock Levels

In this project, we work with a client Gala Groceries, who has contacted Cognizant for logistics advice about product storage

Specifically, they are interested in wanting to know how better stock the items that they sell.
Our role is to take on this project as a data scientist and understand what the client actually needs. This will result in the formulation/confirmation of a new project statement, in which we will be focusing on predicting stock levels of products.
Such a model would enable the client to estimate their product stock levels at a given time & make subsequent business decisions in a more effective manner reducing understocking and overstocking losses.

November 17, 2023
in internship
15 min read

Prediction of customer stable funds volume

Твоей сегодняшней задачей как стажера нашего отдела будет научиться прогнозировать объем стабильных средств клиентов без сроков погашения, в данном конкретном случае это расчетные счета клиентов.

Почему это важно? Номинально, все средства на расчетных счетах клиенты могут в любой момент забрать из Банка, а в ожидании этого Банк не может их использовать в долгосрочном / среднесрочном плане (например, для выдачи кредитов)
Получается, что в такой ситуации Банк ничего не зарабатывает, но платит клиентам проценты по средствам на их счетах, пусть и не высокие, но в масштабах бизнеса Банка эти убытки могут быть значительны

October 21, 2023
in science
13 min read

Gene Classification

In this notebook, we look at how to work with biological sequence data, by venturing into a classification problem, in which we will be classifying between seven different genes groups common to three different species (human,chimpanzee & dog)

Open Colab Notebook
Kaggle Dataset

October 13, 2023
in pyspark
14 min read

Utilising Prophet with PySpark

In this notebook, we look at how to use a popular machine learning library prophet with the pyspark architecture. pyspark itself unfortunatelly does not contain such an additive regression model, however we can utilise user defined functions, UDF, which allows us to utilise different functionality of different libraries that is not available in pyspark

September 25, 2023
in eda
8 min read

Comparison of Subsets

An important concept in machine learning is model generalisation & performance deterioration

When we train a model, we perform an optimisation step, using metrics and/or loss values we can understand how well our model is understanding the relation between all data points and features in the input data we feed it
Going through this process, we can tune a model so that it performs well on the data that we use to train it

August 22, 2023
in pyspark
5 min read

Hyperparameter Tuning with Pipelines

This post is the last of the three posts on the titanic classification problem in pyspark

In the last post, we started with a clearned dataset, which we prepared for machine learning, by utilising StringIndexer & VectorAssembler, and then the model training stage itself.
These steps are a series of stages in the construction of a model, which we can group into a single pipline. pyspark like sklearn has such pipeline classes that help us keep things organised