2stagerecsys

In [1]:

Copied!

!pip install implicit -qqq
!pip install catboost -qqq
!pip install implicit -qqq
!pip install catboost -qqq

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 66.1 MB/s eta 0:00:00

In [2]:

Copied!





import datetime
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import scipy.sparse as sparse
from catboost import CatBoostClassifier
import implicit
from implicit.bpr import BayesianPersonalizedRanking as BPR
import warnings; warnings.filterwarnings('ignore')
import datetime
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import scipy.sparse as sparse
from catboost import CatBoostClassifier
import implicit
from implicit.bpr import BayesianPersonalizedRanking as BPR
import warnings; warnings.filterwarnings('ignore')

In [3]:

Copied!





def recall(df: pd.DataFrame, pred_col='preds', true_col='item_id', k=30) -> float:
    recall_values = []
    for _, row in df.iterrows():
      num_relevant = len(set(row[true_col]) & set(row[pred_col][:k]))
      num_true = len(row[true_col])
      recall_values.append(num_relevant / num_true)
    return np.mean(recall_values)

def precision(df: pd.DataFrame, pred_col='preds', true_col='item_id', k=30) -> float:
    precision_values = []
    for _, row in df.iterrows():
      num_relevant = len(set(row[true_col]) & set(row[pred_col][:k]))
      num_true = min(k, len(row[true_col]))
      precision_values.append(num_relevant / num_true)
    return np.mean(precision_values)

def mrr(df: pd.DataFrame, pred_col='preds', true_col='item_id', k=30) -> float:
    mrr_values = []
    for _, row in df.iterrows():
      intersection = set(row[true_col]) & set(row[pred_col][:k])
      user_mrr = 0
      if len(intersection) > 0:
          for item in intersection:
              user_mrr = max(user_mrr, 1 / (row[pred_col].index(item) + 1))
      mrr_values.append(user_mrr)
    return np.mean(mrr_values)
def recall(df: pd.DataFrame, pred_col='preds', true_col='item_id', k=30) -> float:
    recall_values = []
    for _, row in df.iterrows():
      num_relevant = len(set(row[true_col]) & set(row[pred_col][:k]))
      num_true = len(row[true_col])
      recall_values.append(num_relevant / num_true)
    return np.mean(recall_values)

def precision(df: pd.DataFrame, pred_col='preds', true_col='item_id', k=30) -> float:
    precision_values = []
    for _, row in df.iterrows():
      num_relevant = len(set(row[true_col]) & set(row[pred_col][:k]))
      num_true = min(k, len(row[true_col]))
      precision_values.append(num_relevant / num_true)
    return np.mean(precision_values)

def mrr(df: pd.DataFrame, pred_col='preds', true_col='item_id', k=30) -> float:
    mrr_values = []
    for _, row in df.iterrows():
      intersection = set(row[true_col]) & set(row[pred_col][:k])
      user_mrr = 0
      if len(intersection) > 0:
          for item in intersection:
              user_mrr = max(user_mrr, 1 / (row[pred_col].index(item) + 1))
      mrr_values.append(user_mrr)
    return np.mean(mrr_values)

In [4]:

Copied!

import os; os.listdir('/kaggle/input/kion-dataset/')
import os; os.listdir('/kaggle/input/kion-dataset/')

Out[4]:

['items.csv', 'users.csv', 'interactions.csv']

1 ❙ Background
¶

In this notebook we will be doing the following:

Split time interaction data into 3 "global parts" (train,val,test). We treat train and val as subsets for which we have recommendations. And test is used for making the user recommendations. The existing interactions are only used as reference to understand how well the models perform together
Train 1st stage BRP model to generate candidates (ranked features) to be used in val & test classifier model
Train 2nd stage classifier based on positive and negative samples (from candidate and val missing combination) with their item ranking in order of candidate priority from 1st stage model prediction
The 2nd stage model is trained on a train subset of users, validated on the validation subset of users and then we check how well it performs on the unseen subset of user data, all within the val global subset
Once we have both models, we create recommendations on the test global set (last 7 days)
Using the first stage model, we create candidates like on val, but this time we group them with test and define the new rank order, and we evaluate the metrics
Lastly we use the 2nd stage model, trained on validation set val to get probability of positive class (ctb_pred) using the previous step data, and sort based on this new order rank_ctb and reevaluate the new metrics

2 ❙ Read Dataset : Film interactions
¶

KION Movie/Serial Dataset:

interactions contain user/item interaction information
users contains information about the user (user_id)
items contains information about the item (item_id)

The dataset contains quite the typical recommendation contents of user/item interaction data, some information about the user and item (which in this dataset are movies/serials)

Lets start off by reading the datasets and exploring them later

In [5]:

Copied!

interactions = pd.read_csv("/kaggle/input/kion-dataset/interactions.csv")
items = pd.read_csv("/kaggle/input/kion-dataset/items.csv")
users = pd.read_csv("/kaggle/input/kion-dataset/users.csv")
interactions = pd.read_csv("/kaggle/input/kion-dataset/interactions.csv")
items = pd.read_csv("/kaggle/input/kion-dataset/items.csv")
users = pd.read_csv("/kaggle/input/kion-dataset/users.csv")

In [6]:

Copied!

# convert the column [last_watch_dt] into datetime
interactions['last_watch_dt'] = pd.to_datetime(interactions['last_watch_dt']).map(lambda x: x.date())

print(f"Уникальных юзеров в interactions: {interactions['user_id'].nunique()}")
print(f"Уникальных айтемов в interactions: {interactions['item_id'].nunique()}")
# convert the column [last_watch_dt] into datetime
interactions['last_watch_dt'] = pd.to_datetime(interactions['last_watch_dt']).map(lambda x: x.date())

print(f"Уникальных юзеров в interactions: {interactions['user_id'].nunique()}")
print(f"Уникальных айтемов в interactions: {interactions['item_id'].nunique()}")

Уникальных юзеров в interactions: 962179
Уникальных айтемов в interactions: 15706

User/Item Interactions ¶

Standard user/item interaction features:

user_id : users
item_id : film/serial
last_watch_dt : The last watched data of movie/serial
total_dur : Watched duration (implicit interaction)
watched_pct : Watched percentage (implicit interaction)

We have a dataset which contains implicit user/item interactions. We'll be using watched_pct as our interaction column.

In [7]:

Copied!

interactions.head()
interactions.head()

Out[7]:

	user_id	item_id	last_watch_dt	total_dur	watched_pct
0	176549	9506	2021-05-11	4250	72.0
1	699317	1659	2021-05-29	8317	100.0
2	656683	7107	2021-05-09	10	0.0
3	864613	7638	2021-07-05	14483	100.0
4	964868	9506	2021-04-30	6725	100.0

Item Information

Information about the movies/serials item_id

content_type - Type of item
title - Title of item
title_orig - Original title name
release_year - Date of release
countries - Countries
for_kids - For kids
age_rating- Age rating
studios - film studio
directors - Directors
actors- Actors
keywords - Keywords
description - Description

In [8]:

Copied!

users.head(2)
users.head(2)

Out[8]:

	user_id	age	income	sex	kids_flg
0	973171	age_25_34	income_60_90	М	1
1	962099	age_18_24	income_20_40	М	0

User Features¶

Features that tell us about the user_id

age : Age group
income : User income
sex : User gender
kids_flg : Kid flag identifier

In [9]:

Copied!

items.head(2)
items.head(2)

Out[9]:

	item_id	content_type	title	title_orig	release_year	genres	countries	for_kids	age_rating	studios	directors	actors	description	keywords
0	10711	film	Поговори с ней	Hable con ella	2002.0	драмы, зарубежные, детективы, мелодрамы	Испания	NaN	16.0	NaN	Педро Альмодовар	Адольфо Фернандес, Ана Фернандес, Дарио Гранди...	Мелодрама легендарного Педро Альмодовара «Пого...	Поговори, ней, 2002, Испания, друзья, любовь, ...
1	2508	film	Голые перцы	Search Party	2014.0	зарубежные, приключения, комедии	США	NaN	16.0	NaN	Скот Армстронг	Адам Палли, Брайан Хаски, Дж.Б. Смув, Джейсон ...	Уморительная современная комедия на популярную...	Голые, перцы, 2014, США, друзья, свадьбы, прео...

3 ❙ Preprocessing Dataset
¶

Preprocessing stages:

We should probably do some form of preprocessing in order for the model to have sufficient data to extract relations in data when training
Filter from [interactions] views with less than 300 second views
Filter [users] (user_id) who have less than 10 film views
Filter out [movies] (item_id) which have less than 10 film views

1) Filter accidental views¶

Interaction dataset contains column total_dur, which tells us how many seconds the user has watched the item_id. Lets filter out items which were for whatever reason started and not watched further, setting the threshold to 300 seconds

In [10]:

Copied!

interactions = interactions[interactions['total_dur'] >= 300]
interactions = interactions[interactions['total_dur'] >= 300]

2) User Filtration¶

In [11]:

Copied!

user_interactions_count = interactions.groupby('user_id')[['item_id']].count().reset_index()
filtered_users = user_interactions_count[user_interactions_count['item_id'] >= 10][['user_id']]
interactions = filtered_users.merge(interactions, how='left')
user_interactions_count = interactions.groupby('user_id')[['item_id']].count().reset_index()
filtered_users = user_interactions_count[user_interactions_count['item_id'] >= 10][['user_id']]
interactions = filtered_users.merge(interactions, how='left')

3) Film Filtration¶

In [12]:

Copied!

item_interactions_count = interactions.groupby('item_id')[['user_id']].count().reset_index()
filtered_items = item_interactions_count[item_interactions_count['user_id'] >= 10][['item_id']]
interactions = filtered_items.merge(interactions, how='left')
item_interactions_count = interactions.groupby('item_id')[['user_id']].count().reset_index()
filtered_items = item_interactions_count[item_interactions_count['user_id'] >= 10][['item_id']]
interactions = filtered_items.merge(interactions, how='left')

4 ❙ Create subsets
¶

Timeseries splitting¶

We will be splitting the interactions dataset into different parts, based on the datetime columns last_watch_dt, which is the only time related column in the dataset
Different parts of the interactions datasaet will be used for different purposes in the two stage process
In this notebook, we assume that the global training and validation sets are those that we have, beyond this data we want to make predictions, but since we also have this data, well monitor the recommendation metrics

In [13]:

Copied!

max_date = interactions['last_watch_dt'].max()
min_date = interactions['last_watch_dt'].min()

print(f"min дата в interactions: {min_date}")
print(f"max дата в interactions: {max_date}")
max_date = interactions['last_watch_dt'].max()
min_date = interactions['last_watch_dt'].min()

print(f"min дата в interactions: {min_date}")
print(f"max дата в interactions: {max_date}")

min дата в interactions: 2021-03-13
max дата в interactions: 2021-08-22

Split information:

[test] : Contains the last 7 days of interactions
[train_val] : train & validation dataset
[train_val]
- [train] up to last 60 days of interactions (to test start date)
- [val] last 60 days of interactions (to test start date)

In [14]:

Copied!





# global test dataset starting time (7 days)
test_threshold = max_date - pd.Timedelta(days=7)

# validation dataset starting time (2 months)
val_threshold = test_threshold - pd.Timedelta(days=60) 

test = interactions[(interactions['last_watch_dt'] >= test_threshold)]
train_val = interactions[(interactions['last_watch_dt'] < test_threshold)]
val = train_val[(train_val['last_watch_dt'] >= val_threshold)]
train = train_val[(train_val['last_watch_dt'] < val_threshold)]

print(f"train: {train.shape}")
print(f"val: {val.shape}")
print(f"test: {test.shape}")
# global test dataset starting time (7 days)
test_threshold = max_date - pd.Timedelta(days=7)

# validation dataset starting time (2 months)
val_threshold = test_threshold - pd.Timedelta(days=60) 

test = interactions[(interactions['last_watch_dt'] >= test_threshold)]
train_val = interactions[(interactions['last_watch_dt'] < test_threshold)]
val = train_val[(train_val['last_watch_dt'] >= val_threshold)]
train = train_val[(train_val['last_watch_dt'] < val_threshold)]

print(f"train: {train.shape}")
print(f"val: {val.shape}")
print(f"test: {test.shape}")

train: (881660, 5)
val: (1246263, 5)
test: (172593, 5)

5 ❙ Item candidate selection
¶

Model 1 purpose:

The purpose of the first stage model is to generate item candidates for the second stage model
The BPR based model from library implicit will be used, which uses a loss function suitable for sorting.
The model outputs will be ranked based on their recommendation order

The end product of this section will be:

positive samples : correctly identified item_id between the model output (trained on train) and the val global subset. A limit of 30 candidates (k=30) is set
negative samples : joining the prediction & val, those which don't have any watched_pct data, ie. no interaction is found in interactions
rank in the order that the model had predicted; this will be one of the features in the 2nd stage model

In [15]:

Copied!





# train model on [train]
users_id = list(np.sort(train.user_id.unique()))
items_train = list(train.item_id.unique())
ratings_train = list(train.watched_pct)

rows_train = train.user_id.astype('category').cat.codes
cols_train = train.item_id.astype('category').cat.codes

# create rating matrix (watched percentage [watched_pct])
train_sparse = sparse.csr_matrix((ratings_train, (rows_train, cols_train)), 
                                 shape=(len(users_id), len(items_train)))

algo = BPR(factors=50, 
            regularization=0.01, 
            iterations=50, 
            use_gpu=False)
algo.fit((train_sparse).astype('double'))

# Output [1] from first model; 
# user and item factorisation matrices
user_vecs = algo.user_factors
item_vecs = algo.item_factors


# BPR implicit prediction 
def predict(user_vecs, item_vecs, k=10):
    
    """
    
    Helper function for matrix factorisation prediction
    
    """
    
    id2user = dict(zip(rows_train, train.user_id))
    id2item = dict(zip(cols_train, train.item_id))
    scores = user_vecs.dot(item_vecs.T)

    ind_part = np.argpartition(scores, -k + 1)[:, -k:].copy()
    scores_not_sorted = np.take_along_axis(scores, ind_part, axis=1)
    ind_sorted = np.argsort(scores_not_sorted, axis=1)
    indices = np.take_along_axis(ind_part, ind_sorted, axis=1)
    indices = np.flip(indices, 1)
    preds = pd.DataFrame({
        'user_id': range(user_vecs.shape[0]),
        'preds': indices.tolist(),
        })
    preds['user_id'] = preds['user_id'].map(id2user)
    preds['preds'] = preds['preds'].map(lambda inds: [id2item[i] for i in inds])
    return preds


k=30

# films watched in [val] dataset
val_user_history = val.groupby('user_id')[['item_id']].agg(lambda x: list(x))

# films recommended from algo (on train)
pred_bpr = predict(user_vecs, item_vecs, k)
pred_bpr = val_user_history.merge(pred_bpr, how='left', on='user_id')
pred_bpr = pred_bpr.dropna(subset=['preds'])
pred_bpr.head()
# train model on [train]
users_id = list(np.sort(train.user_id.unique()))
items_train = list(train.item_id.unique())
ratings_train = list(train.watched_pct)

rows_train = train.user_id.astype('category').cat.codes
cols_train = train.item_id.astype('category').cat.codes

# create rating matrix (watched percentage [watched_pct])
train_sparse = sparse.csr_matrix((ratings_train, (rows_train, cols_train)), 
                                 shape=(len(users_id), len(items_train)))

algo = BPR(factors=50, 
            regularization=0.01, 
            iterations=50, 
            use_gpu=False)
algo.fit((train_sparse).astype('double'))

# Output [1] from first model; 
# user and item factorisation matrices
user_vecs = algo.user_factors
item_vecs = algo.item_factors


# BPR implicit prediction 
def predict(user_vecs, item_vecs, k=10):
    
    """
    
    Helper function for matrix factorisation prediction
    
    """
    
    id2user = dict(zip(rows_train, train.user_id))
    id2item = dict(zip(cols_train, train.item_id))
    scores = user_vecs.dot(item_vecs.T)

    ind_part = np.argpartition(scores, -k + 1)[:, -k:].copy()
    scores_not_sorted = np.take_along_axis(scores, ind_part, axis=1)
    ind_sorted = np.argsort(scores_not_sorted, axis=1)
    indices = np.take_along_axis(ind_part, ind_sorted, axis=1)
    indices = np.flip(indices, 1)
    preds = pd.DataFrame({
        'user_id': range(user_vecs.shape[0]),
        'preds': indices.tolist(),
        })
    preds['user_id'] = preds['user_id'].map(id2user)
    preds['preds'] = preds['preds'].map(lambda inds: [id2item[i] for i in inds])
    return preds


k=30

# films watched in [val] dataset
val_user_history = val.groupby('user_id')[['item_id']].agg(lambda x: list(x))

# films recommended from algo (on train)
pred_bpr = predict(user_vecs, item_vecs, k)
pred_bpr = val_user_history.merge(pred_bpr, how='left', on='user_id')
pred_bpr = pred_bpr.dropna(subset=['preds'])
pred_bpr.head()

Out[15]:

	user_id	item_id	preds
0	2	[242, 3628, 5819, 7106, 7921, 8482, 9164, 1077...	[3166, 9164, 12965, 12299, 11919, 8482, 4072, ...
2	21	[308, 3784, 4495, 5077, 6384, 7102, 7571, 8251...	[849, 1053, 11237, 826, 4382, 11661, 24, 14703...
3	30	[1107, 2346, 2743, 3031, 7250, 9728, 9842, 112...	[10464, 10440, 16447, 7946, 2100, 2303, 9728, ...
4	46	[10440]	[142, 4880, 9996, 6809, 11640, 10440, 2498, 86...
6	60	[1179, 1343, 1590, 3550, 6044, 6606, 8612, 972...	[4880, 13865, 4151, 1083, 7107, 1449, 7571, 11...

Check the metrics to understand how good the model recommends relevant item_id to users

In [16]:

Copied!





# metrics for trained bpr prediction & [val] dataset overlap
print('recall',round(recall(pred_bpr),3))
print('precision',round(precision(pred_bpr),3))
print('mrr',round(mrr(pred_bpr),3))

# metrics for trained bpr prediction & [val] dataset overlap
print('recall',round(recall(pred_bpr),3))
print('precision',round(precision(pred_bpr),3))
print('mrr',round(mrr(pred_bpr),3))

recall 0.116
precision 0.117
mrr 0.132

Prepare CatBoost model dataset:

Prepare the dataset for the 2nd stage model, by exploding the item_id and adding the order ranking
Each rating is partitions for each user, starting from a value of 1

In [17]:

Copied!





candidates = pred_bpr[['user_id', 'preds']]
candidates = candidates.explode('preds').rename(columns={'preds': 'item_id'})
candidates['rank'] = candidates.groupby('user_id').cumcount() + 1
candidates.head()

candidates = pred_bpr[['user_id', 'preds']]
candidates = candidates.explode('preds').rename(columns={'preds': 'item_id'})
candidates['rank'] = candidates.groupby('user_id').cumcount() + 1
candidates.head()

Out[17]:

user_id	item_id	rank
2	3166	1
2	9164	2
2	12965	3
2	12299	4
2	11919	5

6 ❙ Prepare dataset for 2nd stage model
¶

Prepare the positive data samples:

Using an inner join between datasets candidates (1st model prediction) & val (global validation dataset)
Each row represents a found overlapping user_id / item_id combination that was present in the interactions dataset (val)

In [18]:

Copied!





pos = candidates.merge(val,
                       on=['user_id', 'item_id'],
                       how='inner')
pos['target'] = 1
print('number of positive samples',pos.shape)
pos.head()
pos = candidates.merge(val,
                       on=['user_id', 'item_id'],
                       how='inner')
pos['target'] = 1
print('number of positive samples',pos.shape)
pos.head()

number of positive samples (64588, 7)

Out[18]:

	user_id	item_id	rank	last_watch_dt	total_dur	watched_pct	target
0	2	9164	2	2021-06-23	6650	100.0	1
1	2	8482	6	2021-06-18	5886	100.0	1
2	30	9728	7	2021-06-21	8436	100.0	1
3	46	10440	6	2021-07-05	7449	20.0	1
4	60	9728	22	2021-06-23	8066	100.0	1

Prepare negative data samples

There will be much more negative samples
The validation dataset is added to the predictions using left join, so we will have some negative candidates (user/item) combinations that don't exist in the val dataset
Well keep the negative to positive sample ration at roughly 2:1

In [19]:

Copied!





# предсказанные фильмы они не смотрели в валидационной выборке
# defaults to left join
neg = candidates.set_index(['user_id', 'item_id'])\
        .join(val.set_index(['user_id', 'item_id']))

neg = neg[neg['watched_pct'].isnull()].reset_index()
print(neg.shape)
neg = neg.sample(frac=0.07)
print(neg.shape)
neg['target'] = 0
neg.head()
# предсказанные фильмы они не смотрели в валидационной выборке
# defaults to left join
neg = candidates.set_index(['user_id', 'item_id'])\
        .join(val.set_index(['user_id', 'item_id']))

neg = neg[neg['watched_pct'].isnull()].reset_index()
print(neg.shape)
neg = neg.sample(frac=0.07)
print(neg.shape)
neg['target'] = 0
neg.head()

(1814012, 6)
(126981, 6)

Out[19]:

	user_id	item_id	rank	last_watch_dt	total_dur	watched_pct
130488	80687	6588	10	NaN	NaN	NaN
1450066	875976	16356	21	NaN	NaN	NaN
1135561	685640	9157	26	NaN	NaN	NaN
736980	445205	10770	14	NaN	NaN	NaN
1766269	1069796	6006	13	NaN	NaN	NaN

Split the val users

Split the val unique users into train (ctb_train_users), validation (ctb_eval_users) and test (ctb_test_users) subsets, used only for creating subsets for catboost 2nd stage model

In [20]:

Copied!





# divide the users into 3 subgroups

ctb_train_users, ctb_test_users = train_test_split(val['user_id'].unique(),
                                                  random_state=1,
                                                  test_size=0.2)

ctb_train_users, ctb_eval_users = train_test_split(ctb_train_users,
                                                  random_state=1,
                                                  test_size=0.1)

print('number of users in ctb train',ctb_train_users)
print('number of users in ctb eval',ctb_eval_users)
print('number of users in ctb test',ctb_test_users)
# divide the users into 3 subgroups

ctb_train_users, ctb_test_users = train_test_split(val['user_id'].unique(),
                                                  random_state=1,
                                                  test_size=0.2)

ctb_train_users, ctb_eval_users = train_test_split(ctb_train_users,
                                                  random_state=1,
                                                  test_size=0.1)

print('number of users in ctb train',ctb_train_users)
print('number of users in ctb eval',ctb_eval_users)
print('number of users in ctb test',ctb_test_users)

number of users in ctb train [790260 678092  74663 ... 239212 914265 167935]
number of users in ctb eval [633541 825546 531429 ... 102166 686044  55175]
number of users in ctb test [1069931  871670  834200 ...  178739  463296 1091209]

Define the subset groups for the user groups we just defined

cbt_train : used for training
cbt_eval : used for evaluation during training
cbt_test : used for test evaluation on unseen data

In [21]:

Copied!





select_col = ['user_id', 'item_id', 'rank', 'target']

# Training dataset
ctb_train = shuffle(
    pd.concat([
        pos[pos['user_id'].isin(ctb_train_users)],
        neg[neg['user_id'].isin(ctb_train_users)]
])[select_col]
)
display(ctb_train.head())


# Test subset (used to check metrics on unseen users)
ctb_test = shuffle(
    pd.concat([
        pos[pos['user_id'].isin(ctb_test_users)],
        neg[neg['user_id'].isin(ctb_test_users)]
])[select_col]
)

# Evaluation train subset
ctb_eval = shuffle(
    pd.concat([
        pos[pos['user_id'].isin(ctb_eval_users)],
        neg[neg['user_id'].isin(ctb_eval_users)]
])[select_col]
)
select_col = ['user_id', 'item_id', 'rank', 'target']

# Training dataset
ctb_train = shuffle(
    pd.concat([
        pos[pos['user_id'].isin(ctb_train_users)],
        neg[neg['user_id'].isin(ctb_train_users)]
])[select_col]
)
display(ctb_train.head())


# Test subset (used to check metrics on unseen users)
ctb_test = shuffle(
    pd.concat([
        pos[pos['user_id'].isin(ctb_test_users)],
        neg[neg['user_id'].isin(ctb_test_users)]
])[select_col]
)

# Evaluation train subset
ctb_eval = shuffle(
    pd.concat([
        pos[pos['user_id'].isin(ctb_eval_users)],
        neg[neg['user_id'].isin(ctb_eval_users)]
])[select_col]
)

	user_id	item_id	rank	target
1755444	1063468	2498	1	0
13617	232693	10440	29	1
682599	412208	8101	15	0
1008013	609311	11275	12	0
1588295	960986	15695	24	0

Add user_id and item_id features to the train and evaluation subsets

In [22]:

Copied!





user_col = ['user_id', 'age', 'income', 'sex', 'kids_flg']
item_col = ['item_id', 'content_type', 'countries', 'for_kids', 'age_rating', 'studios']

# train   
train_feat = (ctb_train
              .merge(users[user_col], on=['user_id'], how='left')
              .merge(items[item_col], on=['item_id'], how='left'))

# evaluation dataset with train for early stopping
eval_feat = (ctb_eval
             .merge(users[user_col], on=['user_id'], how='left')
             .merge(items[item_col], on=['item_id'], how='left'))

train_feat.head()
user_col = ['user_id', 'age', 'income', 'sex', 'kids_flg']
item_col = ['item_id', 'content_type', 'countries', 'for_kids', 'age_rating', 'studios']

# train   
train_feat = (ctb_train
              .merge(users[user_col], on=['user_id'], how='left')
              .merge(items[item_col], on=['item_id'], how='left'))

# evaluation dataset with train for early stopping
eval_feat = (ctb_eval
             .merge(users[user_col], on=['user_id'], how='left')
             .merge(items[item_col], on=['item_id'], how='left'))

train_feat.head()

Out[22]:

	user_id	item_id	rank	target	age	income	sex	kids_flg	content_type	countries	for_kids	age_rating	studios
0	1063468	2498	1	0	age_45_54	income_20_40	М	0.0	film	Россия, Армения	NaN	16.0	NaN
1	232693	10440	29	1	age_25_34	income_40_60	Ж	0.0	series	Россия	NaN	18.0	NaN
2	412208	8101	15	0	age_25_34	income_20_40	Ж	0.0	series	Россия	NaN	0.0	NaN
3	609311	11275	12	0	age_35_44	income_20_40	М	1.0	film	Россия	NaN	16.0	NaN
4	960986	15695	24	0	age_35_44	income_60_90	Ж	1.0	series	США	NaN	12.0	NaN

Split the target from the dataset and set the features for training and evaluation subsets

In [23]:

Copied!





'''

Define column information for model 

- drop columns [user_id], [item_id]
- target column: [target]
- categorical columns: [age] [income] [sex] [content_type] [countries] [studios]

'''

# drop pointless columns and separate target
drop_col = ['user_id', 'item_id']
target_col = ['target']

# we will define the categorical columns in catboost
cat_col = ['age', 'income', 'sex', 'content_type', 'countries', 'studios']

X_train, y_train = train_feat.drop(drop_col + target_col, axis=1), train_feat[target_col]
X_val, y_val = eval_feat.drop(drop_col + target_col, axis=1), eval_feat[target_col]
X_train.shape, y_train.shape, X_val.shape, y_val.shape
X_train.head()
'''

Define column information for model 

- drop columns [user_id], [item_id]
- target column: [target]
- categorical columns: [age] [income] [sex] [content_type] [countries] [studios]

'''

# drop pointless columns and separate target
drop_col = ['user_id', 'item_id']
target_col = ['target']

# we will define the categorical columns in catboost
cat_col = ['age', 'income', 'sex', 'content_type', 'countries', 'studios']

X_train, y_train = train_feat.drop(drop_col + target_col, axis=1), train_feat[target_col]
X_val, y_val = eval_feat.drop(drop_col + target_col, axis=1), eval_feat[target_col]
X_train.shape, y_train.shape, X_val.shape, y_val.shape
X_train.head()

Out[23]:

	rank	age	income	sex	kids_flg	content_type	countries	for_kids	age_rating	studios
0	1	age_45_54	income_20_40	М	0.0	film	Россия, Армения	NaN	16.0	NaN
1	29	age_25_34	income_40_60	Ж	0.0	series	Россия	NaN	18.0	NaN
2	15	age_25_34	income_20_40	Ж	0.0	series	Россия	NaN	0.0	NaN
3	12	age_35_44	income_20_40	М	1.0	film	Россия	NaN	16.0	NaN
4	24	age_35_44	income_60_90	Ж	1.0	series	США	NaN	12.0	NaN

In [24]:

Copied!

X_train.isna().sum()
X_train.isna().sum()

Out[24]:

rank                 0
age              25220
income           25080
sex              25294
kids_flg         24161
content_type         0
countries            0
for_kids        134693
age_rating           0
studios         137203
dtype: int64

Process missing values with the most frequent result

In [25]:

Copied!

# fillna for catboost with the most frequent value
X_train = X_train.fillna(X_train.mode().iloc[0])

# fillna for catboost with the most frequent value
X_val = X_val.fillna(X_train.mode().iloc[0])
# fillna for catboost with the most frequent value
X_train = X_train.fillna(X_train.mode().iloc[0])

# fillna for catboost with the most frequent value
X_val = X_val.fillna(X_train.mode().iloc[0])

7 ❙ Training & Evaluation of metrics
¶

Model setting:

We define some standard hyperparameters, which havent been optimised
Well be comparing the model performance using the ROCAUC metric

In [26]:

Copied!





'''

Define Hyperparameters for Classifier

'''

# model hyperparameters
est_params = {
  'subsample': 0.9,
  'max_depth': 5,
  'n_estimators': 2000,
  'learning_rate': 0.01,
  'thread_count': 20,
  'random_state': 42,
  'verbose': 200,
}

ctb_model = CatBoostClassifier(**est_params)

import warnings; warnings.filterwarnings('ignore')
ctb_model.fit(X_train,
              y_train,
              eval_set=(X_val, y_val),
              early_stopping_rounds=100,
              cat_features=cat_col)
'''

Define Hyperparameters for Classifier

'''

# model hyperparameters
est_params = {
  'subsample': 0.9,
  'max_depth': 5,
  'n_estimators': 2000,
  'learning_rate': 0.01,
  'thread_count': 20,
  'random_state': 42,
  'verbose': 200,
}

ctb_model = CatBoostClassifier(**est_params)

import warnings; warnings.filterwarnings('ignore')
ctb_model.fit(X_train,
              y_train,
              eval_set=(X_val, y_val),
              early_stopping_rounds=100,
              cat_features=cat_col)

0:	learn: 0.6902088	test: 0.6903005	best: 0.6903005 (0)	total: 150ms	remaining: 4m 59s
200:	learn: 0.5318929	test: 0.5413494	best: 0.5413494 (200)	total: 13.4s	remaining: 1m 59s
400:	learn: 0.5237664	test: 0.5347690	best: 0.5347690 (400)	total: 28.2s	remaining: 1m 52s
600:	learn: 0.5206398	test: 0.5321545	best: 0.5321545 (600)	total: 42.8s	remaining: 1m 39s
800:	learn: 0.5188275	test: 0.5307083	best: 0.5307083 (800)	total: 57.8s	remaining: 1m 26s
1000:	learn: 0.5178101	test: 0.5299766	best: 0.5299766 (1000)	total: 1m 11s	remaining: 1m 11s
1200:	learn: 0.5167082	test: 0.5291069	best: 0.5291069 (1200)	total: 1m 26s	remaining: 57.6s
1400:	learn: 0.5157984	test: 0.5284334	best: 0.5284334 (1400)	total: 1m 40s	remaining: 43.1s
1600:	learn: 0.5151791	test: 0.5280492	best: 0.5280479 (1598)	total: 1m 54s	remaining: 28.6s
1800:	learn: 0.5147084	test: 0.5277376	best: 0.5277376 (1800)	total: 2m 8s	remaining: 14.2s
1999:	learn: 0.5141424	test: 0.5273592	best: 0.5273592 (1999)	total: 2m 22s	remaining: 0us

bestTest = 0.5273591755
bestIteration = 1999

Out[26]:

<catboost.core.CatBoostClassifier at 0x78e1015b6980>

Evaluation of metrics:

Test model on training set and evaluate the ROCAUC metric

In [27]:

Copied!

# prediction on train subset of users
y_pred = ctb_model.predict_proba(X_train)
f"ROC AUC score = {roc_auc_score(y_train, y_pred[:, 1]):.2f}"
# prediction on train subset of users
y_pred = ctb_model.predict_proba(X_train)
f"ROC AUC score = {roc_auc_score(y_train, y_pred[:, 1]):.2f}"

Out[27]:

'ROC AUC score = 0.78'

Test model on unseen test data and evaluate the ROCAUC metric

In [28]:

Copied!

'''

Prepare Test Set

'''

test_feat = (ctb_test
             .merge(users[user_col], on=['user_id'], how='left')
             .merge(items[item_col], on=['item_id'], how='left'))

# fillna for catboost with the most frequent value
test_feat = test_feat.fillna(X_train.mode().iloc[0])

X_test, y_test = test_feat.drop(drop_col + target_col, axis=1), test_feat['target']

'''

Make prediction on test set

'''

y_pred = ctb_model.predict_proba(X_test)
f"ROC AUC score = {roc_auc_score(y_test, y_pred[:, 1]):.2f}"
'''

Prepare Test Set

'''

test_feat = (ctb_test
             .merge(users[user_col], on=['user_id'], how='left')
             .merge(items[item_col], on=['item_id'], how='left'))

# fillna for catboost with the most frequent value
test_feat = test_feat.fillna(X_train.mode().iloc[0])

X_test, y_test = test_feat.drop(drop_col + target_col, axis=1), test_feat['target']

'''

Make prediction on test set

'''

y_pred = ctb_model.predict_proba(X_test)
f"ROC AUC score = {roc_auc_score(y_test, y_pred[:, 1]):.2f}"

Out[28]:

'ROC AUC score = 0.77'

8 ❙ Recommendation on global test
¶

Time to make recommendations for period after val

Time to return to the global test dataset (ie. last 7 days of interaction data)
We use the last 7 days of interactions as a way to confirm how well the models will work on unseen data

Starting off by grouping item_id for each user and storing them in a list, as before.

In [29]:

Copied!





# group [item_id] for each [user_id] in test (main test)
test = test[test['user_id'].isin(val['user_id'].unique())] # test user_id must contain val user_id
test_user_history = test.groupby('user_id')[['item_id']].agg(lambda x: list(x))
display(test_user_history.head())
# group [item_id] for each [user_id] in test (main test)
test = test[test['user_id'].isin(val['user_id'].unique())] # test user_id must contain val user_id
test_user_history = test.groupby('user_id')[['item_id']].agg(lambda x: list(x))
display(test_user_history.head())

	item_id
user_id
3	[47, 965, 2025, 2722, 9438, 10240]
21	[13787, 14488]
30	[4181, 8584, 8636]
53	[1445, 15629, 15810, 16426]
98	[89, 512]

Now, define 100 candidates from the 1st stage model

Also lets evaluate the overlapping metrics so we know how well the first stage model alone performs

In [30]:

Copied!





# first model prediction for k=100
pred_bpr = predict(user_vecs, item_vecs, k=100)
pred_bpr = test_user_history.merge(pred_bpr, how='left', on='user_id')
pred_bpr = pred_bpr.dropna(subset=['preds'])
display(pred_bpr.head()) # overlap b/w [train] (user_vect/item_vecs) prediction and test set (item_id)

# determine overlapping metrics b/w test and train
print('recall',round(recall(pred_bpr, k=20),3))
print('precision',round(precision(pred_bpr, k=20),3))
print('mrr',round(mrr(pred_bpr, k=20),3))
# first model prediction for k=100
pred_bpr = predict(user_vecs, item_vecs, k=100)
pred_bpr = test_user_history.merge(pred_bpr, how='left', on='user_id')
pred_bpr = pred_bpr.dropna(subset=['preds'])
display(pred_bpr.head()) # overlap b/w [train] (user_vect/item_vecs) prediction and test set (item_id)

# determine overlapping metrics b/w test and train
print('recall',round(recall(pred_bpr, k=20),3))
print('precision',round(precision(pred_bpr, k=20),3))
print('mrr',round(mrr(pred_bpr, k=20),3))

	user_id	item_id	preds
1	21	[13787, 14488]	[849, 1053, 11237, 826, 4382, 11661, 24, 14703...
2	30	[4181, 8584, 8636]	[10464, 10440, 16447, 7946, 2100, 2303, 9728, ...
4	98	[89, 512]	[15410, 9713, 14378, 14053, 12604, 11402, 1049...
5	106	[337, 1439, 2808, 2836, 5411, 6267, 10544, 128...	[16166, 3182, 9506, 15224, 4718, 16270, 10732,...
8	241	[6162, 8986, 10440, 12138]	[13915, 5894, 6588, 10083, 13935, 13913, 16166...

recall 0.044
precision 0.044
mrr 0.021

Now lets rearrange them into two colums and define their ranking order, as before.

In [31]:

Copied!





pred_bpr = pred_bpr[['user_id', 'preds']] 
pred_bpr = pred_bpr.explode('preds').rename(columns={'preds': 'item_id'})
pred_bpr['rank'] = pred_bpr.groupby('user_id').cumcount() + 1 # give rank to each item_id order
pred_bpr.head()
pred_bpr = pred_bpr[['user_id', 'preds']] 
pred_bpr = pred_bpr.explode('preds').rename(columns={'preds': 'item_id'})
pred_bpr['rank'] = pred_bpr.groupby('user_id').cumcount() + 1 # give rank to each item_id order
pred_bpr.head()

Out[31]:

	user_id	item_id	rank
1	21	849	1
1	21	1053	2
1	21	11237	3
1	21	826	4
1	21	4382	5

Add user_id and item_id features to the dataset

We prepared the dataset on which we will make a prediction using our 2nd model classifier

In [32]:

Copied!





pred_bpr_ctb = pred_bpr.copy()

# фичи для теста
score_feat = (pred_bpr_ctb
              .merge(users[user_col], on=['user_id'], how='left')
              .merge(items[item_col], on=['item_id'], how='left'))

# fillna for catboost with the most frequent value
score_feat = score_feat.fillna(X_train.mode().iloc[0])
score_feat.head()
pred_bpr_ctb = pred_bpr.copy()

# фичи для теста
score_feat = (pred_bpr_ctb
              .merge(users[user_col], on=['user_id'], how='left')
              .merge(items[item_col], on=['item_id'], how='left'))

# fillna for catboost with the most frequent value
score_feat = score_feat.fillna(X_train.mode().iloc[0])
score_feat.head()

Out[32]:

	user_id	item_id	rank	age	income	sex	content_type	countries	age_rating	studios
0	21	849	1	age_45_54	income_20_40	Ж	film	США	18.0	BBC
1	21	1053	2	age_45_54	income_20_40	Ж	film	США	18.0	BBC
2	21	11237	3	age_45_54	income_20_40	Ж	film	Россия	16.0	BBC
3	21	826	4	age_45_54	income_20_40	Ж	film	Великобритания	16.0	BBC
4	21	4382	5	age_45_54	income_20_40	Ж	film	США	16.0	BBC

Using the 2nd stage model, we create a new sorting order rank_ctb, which should improve upon the 1 stage model metrics

In [33]:

Copied!





# prediction and sort by predict proba weak values
ctb_prediction = ctb_model.predict_proba(score_feat.drop(drop_col, axis=1, errors='ignore'))

pred_bpr_ctb['ctb_pred'] = ctb_prediction[:, 1] # prob for positive class

pred_bpr_ctb = pred_bpr_ctb.sort_values(
                                        by=['user_id', 'ctb_pred'], 
                                        ascending=[True, False])
pred_bpr_ctb['rank_ctb'] = pred_bpr_ctb.groupby('user_id').cumcount() + 1
pred_bpr_ctb.head()
# prediction and sort by predict proba weak values
ctb_prediction = ctb_model.predict_proba(score_feat.drop(drop_col, axis=1, errors='ignore'))

pred_bpr_ctb['ctb_pred'] = ctb_prediction[:, 1] # prob for positive class

pred_bpr_ctb = pred_bpr_ctb.sort_values(
                                        by=['user_id', 'ctb_pred'], 
                                        ascending=[True, False])
pred_bpr_ctb['rank_ctb'] = pred_bpr_ctb.groupby('user_id').cumcount() + 1
pred_bpr_ctb.head()

Out[33]:

	user_id	item_id	rank	ctb_pred	rank_ctb
1	21	11237	3	0.509732	1
1	21	11661	6	0.453971	2
1	21	15464	13	0.420558	3
1	21	10824	19	0.322406	4
1	21	12659	12	0.302753	5

In [34]:

Copied!





true_items = test.groupby('user_id').agg(lambda x: list(x))[['item_id']].reset_index()
pred_items = pred_bpr_ctb.groupby('user_id').agg(lambda x: list(x))[['item_id']].reset_index().rename(columns={'item_id': 'preds'})
true_pred_items = true_items.merge(pred_items, how='left')
true_pred_items = true_pred_items.dropna(subset=['preds'])
true_pred_items.head()
true_items = test.groupby('user_id').agg(lambda x: list(x))[['item_id']].reset_index()
pred_items = pred_bpr_ctb.groupby('user_id').agg(lambda x: list(x))[['item_id']].reset_index().rename(columns={'item_id': 'preds'})
true_pred_items = true_items.merge(pred_items, how='left')
true_pred_items = true_pred_items.dropna(subset=['preds'])
true_pred_items.head()

Out[34]:

	user_id	item_id	preds
1	21	[13787, 14488]	[11237, 11661, 15464, 10824, 12659, 4382, 8447...
2	30	[4181, 8584, 8636]	[10440, 1465, 15297, 9728, 676, 12346, 12995, ...
4	98	[89, 512]	[1204, 9728, 12346, 15410, 14378, 6447, 9653, ...
5	106	[337, 1439, 2808, 2836, 5411, 6267, 10544, 128...	[3182, 16166, 5894, 9506, 11919, 4718, 5411, 1...
8	241	[6162, 8986, 10440, 12138]	[5894, 13915, 13913, 16166, 3182, 13018, 10761...

Confirm how well the catboost model made user recommendations

In [35]:

Copied!





# evaluate metrics 
print('recall',round(recall(true_pred_items, k=20),3))
print('precision',round(precision(true_pred_items, k=20),3))
print('mrr',round(mrr(true_pred_items, k=20),3))
# evaluate metrics 
print('recall',round(recall(true_pred_items, k=20),3))
print('precision',round(precision(true_pred_items, k=20),3))
print('mrr',round(mrr(true_pred_items, k=20),3))

recall 0.056
precision 0.056
mrr 0.034

2stagerecsys

1 ❙ Background¶

2 ❙ Read Dataset : Film interactions¶