2stagerecsys
!pip install implicit -qqq
!pip install catboost -qqq
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 66.1 MB/s eta 0:00:00
import datetime
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import scipy.sparse as sparse
from catboost import CatBoostClassifier
import implicit
from implicit.bpr import BayesianPersonalizedRanking as BPR
import warnings; warnings.filterwarnings('ignore')
def recall(df: pd.DataFrame, pred_col='preds', true_col='item_id', k=30) -> float:
recall_values = []
for _, row in df.iterrows():
num_relevant = len(set(row[true_col]) & set(row[pred_col][:k]))
num_true = len(row[true_col])
recall_values.append(num_relevant / num_true)
return np.mean(recall_values)
def precision(df: pd.DataFrame, pred_col='preds', true_col='item_id', k=30) -> float:
precision_values = []
for _, row in df.iterrows():
num_relevant = len(set(row[true_col]) & set(row[pred_col][:k]))
num_true = min(k, len(row[true_col]))
precision_values.append(num_relevant / num_true)
return np.mean(precision_values)
def mrr(df: pd.DataFrame, pred_col='preds', true_col='item_id', k=30) -> float:
mrr_values = []
for _, row in df.iterrows():
intersection = set(row[true_col]) & set(row[pred_col][:k])
user_mrr = 0
if len(intersection) > 0:
for item in intersection:
user_mrr = max(user_mrr, 1 / (row[pred_col].index(item) + 1))
mrr_values.append(user_mrr)
return np.mean(mrr_values)
import os; os.listdir('/kaggle/input/kion-dataset/')
['items.csv', 'users.csv', 'interactions.csv']
¶
In this notebook we will be doing the following:
- Split time interaction data into 3 "global parts" (train,val,test). We treat
train
andval
as subsets for which we have recommendations. Andtest
is used for making the user recommendations. The existing interactions are only used as reference to understand how well the models perform together - Train 1st stage BRP model to generate candidates (ranked features) to be used in
val
&test
classifier model - Train 2nd stage classifier based on positive and negative samples (from
candidate
andval
missing combination) with their item ranking in order of candidate priority from 1st stage model prediction - The 2nd stage model is trained on a train subset of users, validated on the validation subset of users and then we check how well it performs on the unseen subset of user data, all within the
val
global subset - Once we have both models, we create recommendations on the
test
global set (last 7 days) - Using the first stage model, we create candidates like on
val
, but this time we group them withtest
and define the new rank order, and we evaluate the metrics - Lastly we use the 2nd stage model, trained on validation set
val
to get probability of positive class (ctb_pred
) using the previous step data, and sort based on this new orderrank_ctb
and reevaluate the new metrics
¶
KION Movie/Serial Dataset:
interactions
contain user/item interaction informationusers
contains information about the user (user_id)items
contains information about the item (item_id)
The dataset contains quite the typical recommendation contents of user/item interaction data, some information about the user and item (which in this dataset are movies/serials)
Lets start off by reading the datasets and exploring them later
interactions = pd.read_csv("/kaggle/input/kion-dataset/interactions.csv")
items = pd.read_csv("/kaggle/input/kion-dataset/items.csv")
users = pd.read_csv("/kaggle/input/kion-dataset/users.csv")
# convert the column [last_watch_dt] into datetime
interactions['last_watch_dt'] = pd.to_datetime(interactions['last_watch_dt']).map(lambda x: x.date())
print(f"Уникальных юзеров в interactions: {interactions['user_id'].nunique()}")
print(f"Уникальных айтемов в interactions: {interactions['item_id'].nunique()}")
Уникальных юзеров в interactions: 962179 Уникальных айтемов в interactions: 15706
User/Item Interactions ¶
Standard user/item interaction features:
user_id
: usersitem_id
: film/seriallast_watch_dt
: The last watched data of movie/serialtotal_dur
: Watched duration (implicit interaction)watched_pct
: Watched percentage (implicit interaction)
We have a dataset which contains implicit user/item interactions. We'll be using watched_pct
as our interaction column.
interactions.head()
user_id | item_id | last_watch_dt | total_dur | watched_pct | |
---|---|---|---|---|---|
0 | 176549 | 9506 | 2021-05-11 | 4250 | 72.0 |
1 | 699317 | 1659 | 2021-05-29 | 8317 | 100.0 |
2 | 656683 | 7107 | 2021-05-09 | 10 | 0.0 |
3 | 864613 | 7638 | 2021-07-05 | 14483 | 100.0 |
4 | 964868 | 9506 | 2021-04-30 | 6725 | 100.0 |
Item Information
Information about the movies/serials item_id
content_type
- Type of itemtitle
- Title of itemtitle_orig
- Original title namerelease_year
- Date of releasecountries
- Countriesfor_kids
- For kidsage_rating
- Age ratingstudios
- film studiodirectors
- Directorsactors
- Actorskeywords
- Keywordsdescription
- Description
users.head(2)
user_id | age | income | sex | kids_flg | |
---|---|---|---|---|---|
0 | 973171 | age_25_34 | income_60_90 | М | 1 |
1 | 962099 | age_18_24 | income_20_40 | М | 0 |
User Features¶
Features that tell us about the user_id
age
: Age groupincome
: User incomesex
: User genderkids_flg
: Kid flag identifier
items.head(2)
item_id | content_type | title | title_orig | release_year | genres | countries | for_kids | age_rating | studios | directors | actors | description | keywords | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10711 | film | Поговори с ней | Hable con ella | 2002.0 | драмы, зарубежные, детективы, мелодрамы | Испания | NaN | 16.0 | NaN | Педро Альмодовар | Адольфо Фернандес, Ана Фернандес, Дарио Гранди... | Мелодрама легендарного Педро Альмодовара «Пого... | Поговори, ней, 2002, Испания, друзья, любовь, ... |
1 | 2508 | film | Голые перцы | Search Party | 2014.0 | зарубежные, приключения, комедии | США | NaN | 16.0 | NaN | Скот Армстронг | Адам Палли, Брайан Хаски, Дж.Б. Смув, Джейсон ... | Уморительная современная комедия на популярную... | Голые, перцы, 2014, США, друзья, свадьбы, прео... |
¶
Preprocessing stages:
We should probably do some form of preprocessing in order for the model to have sufficient data to extract relations in data when training
Filter from [interactions] views with less than 300 second views
Filter [users] (user_id) who have less than 10 film views
Filter out [movies] (item_id) which have less than 10 film views
1) Filter accidental views¶
Interaction dataset contains column total_dur
, which tells us how many seconds the user has watched the item_id
. Lets filter out items which were for whatever reason started and not watched further, setting the threshold to 300 seconds
interactions = interactions[interactions['total_dur'] >= 300]
2) User Filtration¶
user_interactions_count = interactions.groupby('user_id')[['item_id']].count().reset_index()
filtered_users = user_interactions_count[user_interactions_count['item_id'] >= 10][['user_id']]
interactions = filtered_users.merge(interactions, how='left')
3) Film Filtration¶
item_interactions_count = interactions.groupby('item_id')[['user_id']].count().reset_index()
filtered_items = item_interactions_count[item_interactions_count['user_id'] >= 10][['item_id']]
interactions = filtered_items.merge(interactions, how='left')
¶
Timeseries splitting¶
- We will be splitting the interactions dataset into different parts, based on the datetime columns
last_watch_dt
, which is the only time related column in the dataset - Different parts of the interactions datasaet will be used for different purposes in the two stage process
- In this notebook, we assume that the global training and validation sets are those that we have, beyond this data we want to make predictions, but since we also have this data, well monitor the recommendation metrics
max_date = interactions['last_watch_dt'].max()
min_date = interactions['last_watch_dt'].min()
print(f"min дата в interactions: {min_date}")
print(f"max дата в interactions: {max_date}")
min дата в interactions: 2021-03-13 max дата в interactions: 2021-08-22
Split information:
[test]
: Contains the last 7 days of interactions[train_val]
: train & validation dataset[train_val]
[train]
up to last 60 days of interactions (to test start date)[val]
last 60 days of interactions (to test start date)
# global test dataset starting time (7 days)
test_threshold = max_date - pd.Timedelta(days=7)
# validation dataset starting time (2 months)
val_threshold = test_threshold - pd.Timedelta(days=60)
test = interactions[(interactions['last_watch_dt'] >= test_threshold)]
train_val = interactions[(interactions['last_watch_dt'] < test_threshold)]
val = train_val[(train_val['last_watch_dt'] >= val_threshold)]
train = train_val[(train_val['last_watch_dt'] < val_threshold)]
print(f"train: {train.shape}")
print(f"val: {val.shape}")
print(f"test: {test.shape}")
train: (881660, 5) val: (1246263, 5) test: (172593, 5)
¶
Model 1 purpose:
- The purpose of the first stage model is to generate item candidates for the second stage model
- The BPR based model from library implicit will be used, which uses a loss function suitable for sorting.
- The model outputs will be ranked based on their recommendation order
The end product of this section will be:
positive samples
: correctly identifieditem_id
between the model output (trained ontrain
) and theval
global subset. A limit of30 candidates (k=30)
is setnegative samples
: joining the prediction &val
, those which don't have anywatched_pct
data, ie. no interaction is found ininteractions
rank
in the order that the model had predicted; this will be one of the features in the 2nd stage model
# train model on [train]
users_id = list(np.sort(train.user_id.unique()))
items_train = list(train.item_id.unique())
ratings_train = list(train.watched_pct)
rows_train = train.user_id.astype('category').cat.codes
cols_train = train.item_id.astype('category').cat.codes
# create rating matrix (watched percentage [watched_pct])
train_sparse = sparse.csr_matrix((ratings_train, (rows_train, cols_train)),
shape=(len(users_id), len(items_train)))
algo = BPR(factors=50,
regularization=0.01,
iterations=50,
use_gpu=False)
algo.fit((train_sparse).astype('double'))
# Output [1] from first model;
# user and item factorisation matrices
user_vecs = algo.user_factors
item_vecs = algo.item_factors
# BPR implicit prediction
def predict(user_vecs, item_vecs, k=10):
"""
Helper function for matrix factorisation prediction
"""
id2user = dict(zip(rows_train, train.user_id))
id2item = dict(zip(cols_train, train.item_id))
scores = user_vecs.dot(item_vecs.T)
ind_part = np.argpartition(scores, -k + 1)[:, -k:].copy()
scores_not_sorted = np.take_along_axis(scores, ind_part, axis=1)
ind_sorted = np.argsort(scores_not_sorted, axis=1)
indices = np.take_along_axis(ind_part, ind_sorted, axis=1)
indices = np.flip(indices, 1)
preds = pd.DataFrame({
'user_id': range(user_vecs.shape[0]),
'preds': indices.tolist(),
})
preds['user_id'] = preds['user_id'].map(id2user)
preds['preds'] = preds['preds'].map(lambda inds: [id2item[i] for i in inds])
return preds
k=30
# films watched in [val] dataset
val_user_history = val.groupby('user_id')[['item_id']].agg(lambda x: list(x))
# films recommended from algo (on train)
pred_bpr = predict(user_vecs, item_vecs, k)
pred_bpr = val_user_history.merge(pred_bpr, how='left', on='user_id')
pred_bpr = pred_bpr.dropna(subset=['preds'])
pred_bpr.head()
user_id | item_id | preds | |
---|---|---|---|
0 | 2 | [242, 3628, 5819, 7106, 7921, 8482, 9164, 1077... | [3166, 9164, 12965, 12299, 11919, 8482, 4072, ... |
2 | 21 | [308, 3784, 4495, 5077, 6384, 7102, 7571, 8251... | [849, 1053, 11237, 826, 4382, 11661, 24, 14703... |
3 | 30 | [1107, 2346, 2743, 3031, 7250, 9728, 9842, 112... | [10464, 10440, 16447, 7946, 2100, 2303, 9728, ... |
4 | 46 | [10440] | [142, 4880, 9996, 6809, 11640, 10440, 2498, 86... |
6 | 60 | [1179, 1343, 1590, 3550, 6044, 6606, 8612, 972... | [4880, 13865, 4151, 1083, 7107, 1449, 7571, 11... |
Check the metrics to understand how good the model recommends relevant item_id to users
# metrics for trained bpr prediction & [val] dataset overlap
print('recall',round(recall(pred_bpr),3))
print('precision',round(precision(pred_bpr),3))
print('mrr',round(mrr(pred_bpr),3))
recall 0.116 precision 0.117 mrr 0.132
Prepare CatBoost
model dataset:
- Prepare the dataset for the 2nd stage model, by exploding the
item_id
and adding the order ranking - Each rating is partitions for each user, starting from a value of 1
candidates = pred_bpr[['user_id', 'preds']]
candidates = candidates.explode('preds').rename(columns={'preds': 'item_id'})
candidates['rank'] = candidates.groupby('user_id').cumcount() + 1
candidates.head()
user_id | item_id | rank | |
---|---|---|---|
0 | 2 | 3166 | 1 |
0 | 2 | 9164 | 2 |
0 | 2 | 12965 | 3 |
0 | 2 | 12299 | 4 |
0 | 2 | 11919 | 5 |
¶
Prepare the positive
data samples:
- Using an inner join between datasets
candidates
(1st model prediction) &val
(global validation dataset) - Each row represents a found overlapping
user_id
/item_id
combination that was present in the interactions dataset (val
)
pos = candidates.merge(val,
on=['user_id', 'item_id'],
how='inner')
pos['target'] = 1
print('number of positive samples',pos.shape)
pos.head()
number of positive samples (64588, 7)
user_id | item_id | rank | last_watch_dt | total_dur | watched_pct | target | |
---|---|---|---|---|---|---|---|
0 | 2 | 9164 | 2 | 2021-06-23 | 6650 | 100.0 | 1 |
1 | 2 | 8482 | 6 | 2021-06-18 | 5886 | 100.0 | 1 |
2 | 30 | 9728 | 7 | 2021-06-21 | 8436 | 100.0 | 1 |
3 | 46 | 10440 | 6 | 2021-07-05 | 7449 | 20.0 | 1 |
4 | 60 | 9728 | 22 | 2021-06-23 | 8066 | 100.0 | 1 |
Prepare negative
data samples
- There will be much more negative samples
- The validation dataset is added to the predictions using left join, so we will have some negative candidates (user/item) combinations that don't exist in the
val
dataset - Well keep the negative to positive sample ration at roughly 2:1
# предсказанные фильмы они не смотрели в валидационной выборке
# defaults to left join
neg = candidates.set_index(['user_id', 'item_id'])\
.join(val.set_index(['user_id', 'item_id']))
neg = neg[neg['watched_pct'].isnull()].reset_index()
print(neg.shape)
neg = neg.sample(frac=0.07)
print(neg.shape)
neg['target'] = 0
neg.head()
(1814012, 6) (126981, 6)
user_id | item_id | rank | last_watch_dt | total_dur | watched_pct | target | |
---|---|---|---|---|---|---|---|
130488 | 80687 | 6588 | 10 | NaN | NaN | NaN | 0 |
1450066 | 875976 | 16356 | 21 | NaN | NaN | NaN | 0 |
1135561 | 685640 | 9157 | 26 | NaN | NaN | NaN | 0 |
736980 | 445205 | 10770 | 14 | NaN | NaN | NaN | 0 |
1766269 | 1069796 | 6006 | 13 | NaN | NaN | NaN | 0 |
Split the val
users
- Split the
val
unique users into train (ctb_train_users
), validation (ctb_eval_users
) and test (ctb_test_users
) subsets, used only for creating subsets forcatboost
2nd stage model
# divide the users into 3 subgroups
ctb_train_users, ctb_test_users = train_test_split(val['user_id'].unique(),
random_state=1,
test_size=0.2)
ctb_train_users, ctb_eval_users = train_test_split(ctb_train_users,
random_state=1,
test_size=0.1)
print('number of users in ctb train',ctb_train_users)
print('number of users in ctb eval',ctb_eval_users)
print('number of users in ctb test',ctb_test_users)
number of users in ctb train [790260 678092 74663 ... 239212 914265 167935] number of users in ctb eval [633541 825546 531429 ... 102166 686044 55175] number of users in ctb test [1069931 871670 834200 ... 178739 463296 1091209]
Define the subset groups for the user groups we just defined
cbt_train
: used for trainingcbt_eval
: used for evaluation during trainingcbt_test
: used for test evaluation on unseen data
select_col = ['user_id', 'item_id', 'rank', 'target']
# Training dataset
ctb_train = shuffle(
pd.concat([
pos[pos['user_id'].isin(ctb_train_users)],
neg[neg['user_id'].isin(ctb_train_users)]
])[select_col]
)
display(ctb_train.head())
# Test subset (used to check metrics on unseen users)
ctb_test = shuffle(
pd.concat([
pos[pos['user_id'].isin(ctb_test_users)],
neg[neg['user_id'].isin(ctb_test_users)]
])[select_col]
)
# Evaluation train subset
ctb_eval = shuffle(
pd.concat([
pos[pos['user_id'].isin(ctb_eval_users)],
neg[neg['user_id'].isin(ctb_eval_users)]
])[select_col]
)
user_id | item_id | rank | target | |
---|---|---|---|---|
1755444 | 1063468 | 2498 | 1 | 0 |
13617 | 232693 | 10440 | 29 | 1 |
682599 | 412208 | 8101 | 15 | 0 |
1008013 | 609311 | 11275 | 12 | 0 |
1588295 | 960986 | 15695 | 24 | 0 |
Add user_id
and item_id
features to the train and evaluation subsets
user_col = ['user_id', 'age', 'income', 'sex', 'kids_flg']
item_col = ['item_id', 'content_type', 'countries', 'for_kids', 'age_rating', 'studios']
# train
train_feat = (ctb_train
.merge(users[user_col], on=['user_id'], how='left')
.merge(items[item_col], on=['item_id'], how='left'))
# evaluation dataset with train for early stopping
eval_feat = (ctb_eval
.merge(users[user_col], on=['user_id'], how='left')
.merge(items[item_col], on=['item_id'], how='left'))
train_feat.head()
user_id | item_id | rank | target | age | income | sex | kids_flg | content_type | countries | for_kids | age_rating | studios | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1063468 | 2498 | 1 | 0 | age_45_54 | income_20_40 | М | 0.0 | film | Россия, Армения | NaN | 16.0 | NaN |
1 | 232693 | 10440 | 29 | 1 | age_25_34 | income_40_60 | Ж | 0.0 | series | Россия | NaN | 18.0 | NaN |
2 | 412208 | 8101 | 15 | 0 | age_25_34 | income_20_40 | Ж | 0.0 | series | Россия | NaN | 0.0 | NaN |
3 | 609311 | 11275 | 12 | 0 | age_35_44 | income_20_40 | М | 1.0 | film | Россия | NaN | 16.0 | NaN |
4 | 960986 | 15695 | 24 | 0 | age_35_44 | income_60_90 | Ж | 1.0 | series | США | NaN | 12.0 | NaN |
Split the target
from the dataset and set the features for training and evaluation subsets
'''
Define column information for model
- drop columns [user_id], [item_id]
- target column: [target]
- categorical columns: [age] [income] [sex] [content_type] [countries] [studios]
'''
# drop pointless columns and separate target
drop_col = ['user_id', 'item_id']
target_col = ['target']
# we will define the categorical columns in catboost
cat_col = ['age', 'income', 'sex', 'content_type', 'countries', 'studios']
X_train, y_train = train_feat.drop(drop_col + target_col, axis=1), train_feat[target_col]
X_val, y_val = eval_feat.drop(drop_col + target_col, axis=1), eval_feat[target_col]
X_train.shape, y_train.shape, X_val.shape, y_val.shape
X_train.head()
rank | age | income | sex | kids_flg | content_type | countries | for_kids | age_rating | studios | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | age_45_54 | income_20_40 | М | 0.0 | film | Россия, Армения | NaN | 16.0 | NaN |
1 | 29 | age_25_34 | income_40_60 | Ж | 0.0 | series | Россия | NaN | 18.0 | NaN |
2 | 15 | age_25_34 | income_20_40 | Ж | 0.0 | series | Россия | NaN | 0.0 | NaN |
3 | 12 | age_35_44 | income_20_40 | М | 1.0 | film | Россия | NaN | 16.0 | NaN |
4 | 24 | age_35_44 | income_60_90 | Ж | 1.0 | series | США | NaN | 12.0 | NaN |
X_train.isna().sum()
rank 0 age 25220 income 25080 sex 25294 kids_flg 24161 content_type 0 countries 0 for_kids 134693 age_rating 0 studios 137203 dtype: int64
Process missing values with the most frequent result
# fillna for catboost with the most frequent value
X_train = X_train.fillna(X_train.mode().iloc[0])
# fillna for catboost with the most frequent value
X_val = X_val.fillna(X_train.mode().iloc[0])
¶
Model setting:
- We define some standard hyperparameters, which havent been optimised
- Well be comparing the model performance using the
ROCAUC
metric
'''
Define Hyperparameters for Classifier
'''
# model hyperparameters
est_params = {
'subsample': 0.9,
'max_depth': 5,
'n_estimators': 2000,
'learning_rate': 0.01,
'thread_count': 20,
'random_state': 42,
'verbose': 200,
}
ctb_model = CatBoostClassifier(**est_params)
import warnings; warnings.filterwarnings('ignore')
ctb_model.fit(X_train,
y_train,
eval_set=(X_val, y_val),
early_stopping_rounds=100,
cat_features=cat_col)
0: learn: 0.6902088 test: 0.6903005 best: 0.6903005 (0) total: 150ms remaining: 4m 59s 200: learn: 0.5318929 test: 0.5413494 best: 0.5413494 (200) total: 13.4s remaining: 1m 59s 400: learn: 0.5237664 test: 0.5347690 best: 0.5347690 (400) total: 28.2s remaining: 1m 52s 600: learn: 0.5206398 test: 0.5321545 best: 0.5321545 (600) total: 42.8s remaining: 1m 39s 800: learn: 0.5188275 test: 0.5307083 best: 0.5307083 (800) total: 57.8s remaining: 1m 26s 1000: learn: 0.5178101 test: 0.5299766 best: 0.5299766 (1000) total: 1m 11s remaining: 1m 11s 1200: learn: 0.5167082 test: 0.5291069 best: 0.5291069 (1200) total: 1m 26s remaining: 57.6s 1400: learn: 0.5157984 test: 0.5284334 best: 0.5284334 (1400) total: 1m 40s remaining: 43.1s 1600: learn: 0.5151791 test: 0.5280492 best: 0.5280479 (1598) total: 1m 54s remaining: 28.6s 1800: learn: 0.5147084 test: 0.5277376 best: 0.5277376 (1800) total: 2m 8s remaining: 14.2s 1999: learn: 0.5141424 test: 0.5273592 best: 0.5273592 (1999) total: 2m 22s remaining: 0us bestTest = 0.5273591755 bestIteration = 1999
<catboost.core.CatBoostClassifier at 0x78e1015b6980>
Evaluation of metrics:
- Test model on training set and evaluate the
ROCAUC
metric
# prediction on train subset of users
y_pred = ctb_model.predict_proba(X_train)
f"ROC AUC score = {roc_auc_score(y_train, y_pred[:, 1]):.2f}"
'ROC AUC score = 0.78'
- Test model on unseen test data and evaluate the
ROCAUC
metric
'''
Prepare Test Set
'''
test_feat = (ctb_test
.merge(users[user_col], on=['user_id'], how='left')
.merge(items[item_col], on=['item_id'], how='left'))
# fillna for catboost with the most frequent value
test_feat = test_feat.fillna(X_train.mode().iloc[0])
X_test, y_test = test_feat.drop(drop_col + target_col, axis=1), test_feat['target']
'''
Make prediction on test set
'''
y_pred = ctb_model.predict_proba(X_test)
f"ROC AUC score = {roc_auc_score(y_test, y_pred[:, 1]):.2f}"
'ROC AUC score = 0.77'
¶
Time to make recommendations for period after val
Time to return to the global
test
dataset (ie. last 7 days of interaction data)We use the last 7 days of interactions as a way to confirm how well the models will work on unseen data
Starting off by grouping item_id
for each user and storing them in a list, as before.
# group [item_id] for each [user_id] in test (main test)
test = test[test['user_id'].isin(val['user_id'].unique())] # test user_id must contain val user_id
test_user_history = test.groupby('user_id')[['item_id']].agg(lambda x: list(x))
display(test_user_history.head())
item_id | |
---|---|
user_id | |
3 | [47, 965, 2025, 2722, 9438, 10240] |
21 | [13787, 14488] |
30 | [4181, 8584, 8636] |
53 | [1445, 15629, 15810, 16426] |
98 | [89, 512] |
Now, define 100 candidates from the 1st stage model
- Also lets evaluate the overlapping metrics so we know how well the first stage model alone performs
# first model prediction for k=100
pred_bpr = predict(user_vecs, item_vecs, k=100)
pred_bpr = test_user_history.merge(pred_bpr, how='left', on='user_id')
pred_bpr = pred_bpr.dropna(subset=['preds'])
display(pred_bpr.head()) # overlap b/w [train] (user_vect/item_vecs) prediction and test set (item_id)
# determine overlapping metrics b/w test and train
print('recall',round(recall(pred_bpr, k=20),3))
print('precision',round(precision(pred_bpr, k=20),3))
print('mrr',round(mrr(pred_bpr, k=20),3))
user_id | item_id | preds | |
---|---|---|---|
1 | 21 | [13787, 14488] | [849, 1053, 11237, 826, 4382, 11661, 24, 14703... |
2 | 30 | [4181, 8584, 8636] | [10464, 10440, 16447, 7946, 2100, 2303, 9728, ... |
4 | 98 | [89, 512] | [15410, 9713, 14378, 14053, 12604, 11402, 1049... |
5 | 106 | [337, 1439, 2808, 2836, 5411, 6267, 10544, 128... | [16166, 3182, 9506, 15224, 4718, 16270, 10732,... |
8 | 241 | [6162, 8986, 10440, 12138] | [13915, 5894, 6588, 10083, 13935, 13913, 16166... |
recall 0.044 precision 0.044 mrr 0.021
Now lets rearrange them into two colums and define their ranking order, as before.
pred_bpr = pred_bpr[['user_id', 'preds']]
pred_bpr = pred_bpr.explode('preds').rename(columns={'preds': 'item_id'})
pred_bpr['rank'] = pred_bpr.groupby('user_id').cumcount() + 1 # give rank to each item_id order
pred_bpr.head()
user_id | item_id | rank | |
---|---|---|---|
1 | 21 | 849 | 1 |
1 | 21 | 1053 | 2 |
1 | 21 | 11237 | 3 |
1 | 21 | 826 | 4 |
1 | 21 | 4382 | 5 |
Add user_id
and item_id
features to the dataset
We prepared the dataset on which we will make a prediction using our 2nd model classifier
pred_bpr_ctb = pred_bpr.copy()
# фичи для теста
score_feat = (pred_bpr_ctb
.merge(users[user_col], on=['user_id'], how='left')
.merge(items[item_col], on=['item_id'], how='left'))
# fillna for catboost with the most frequent value
score_feat = score_feat.fillna(X_train.mode().iloc[0])
score_feat.head()
user_id | item_id | rank | age | income | sex | kids_flg | content_type | countries | for_kids | age_rating | studios | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21 | 849 | 1 | age_45_54 | income_20_40 | Ж | 0.0 | film | США | 0.0 | 18.0 | BBC |
1 | 21 | 1053 | 2 | age_45_54 | income_20_40 | Ж | 0.0 | film | США | 0.0 | 18.0 | BBC |
2 | 21 | 11237 | 3 | age_45_54 | income_20_40 | Ж | 0.0 | film | Россия | 0.0 | 16.0 | BBC |
3 | 21 | 826 | 4 | age_45_54 | income_20_40 | Ж | 0.0 | film | Великобритания | 0.0 | 16.0 | BBC |
4 | 21 | 4382 | 5 | age_45_54 | income_20_40 | Ж | 0.0 | film | США | 0.0 | 16.0 | BBC |
Using the 2nd stage model, we create a new sorting order rank_ctb
, which should improve upon the 1 stage model metrics
# prediction and sort by predict proba weak values
ctb_prediction = ctb_model.predict_proba(score_feat.drop(drop_col, axis=1, errors='ignore'))
pred_bpr_ctb['ctb_pred'] = ctb_prediction[:, 1] # prob for positive class
pred_bpr_ctb = pred_bpr_ctb.sort_values(
by=['user_id', 'ctb_pred'],
ascending=[True, False])
pred_bpr_ctb['rank_ctb'] = pred_bpr_ctb.groupby('user_id').cumcount() + 1
pred_bpr_ctb.head()
user_id | item_id | rank | ctb_pred | rank_ctb | |
---|---|---|---|---|---|
1 | 21 | 11237 | 3 | 0.509732 | 1 |
1 | 21 | 11661 | 6 | 0.453971 | 2 |
1 | 21 | 15464 | 13 | 0.420558 | 3 |
1 | 21 | 10824 | 19 | 0.322406 | 4 |
1 | 21 | 12659 | 12 | 0.302753 | 5 |
true_items = test.groupby('user_id').agg(lambda x: list(x))[['item_id']].reset_index()
pred_items = pred_bpr_ctb.groupby('user_id').agg(lambda x: list(x))[['item_id']].reset_index().rename(columns={'item_id': 'preds'})
true_pred_items = true_items.merge(pred_items, how='left')
true_pred_items = true_pred_items.dropna(subset=['preds'])
true_pred_items.head()
user_id | item_id | preds | |
---|---|---|---|
1 | 21 | [13787, 14488] | [11237, 11661, 15464, 10824, 12659, 4382, 8447... |
2 | 30 | [4181, 8584, 8636] | [10440, 1465, 15297, 9728, 676, 12346, 12995, ... |
4 | 98 | [89, 512] | [1204, 9728, 12346, 15410, 14378, 6447, 9653, ... |
5 | 106 | [337, 1439, 2808, 2836, 5411, 6267, 10544, 128... | [3182, 16166, 5894, 9506, 11919, 4718, 5411, 1... |
8 | 241 | [6162, 8986, 10440, 12138] | [5894, 13915, 13913, 16166, 3182, 13018, 10761... |
Confirm how well the catboost model made user recommendations
# evaluate metrics
print('recall',round(recall(true_pred_items, k=20),3))
print('precision',round(precision(true_pred_items, k=20),3))
print('mrr',round(mrr(true_pred_items, k=20),3))
recall 0.056 precision 0.056 mrr 0.034