Dssm towers
!pip install pytorch-lightning -qqq
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import warnings
warnings.filterwarnings('ignore')
from collections import Counter
from random import randint, random
from scipy.sparse import coo_matrix, hstack
import torch
from pathlib import Path
from tqdm import tqdm
Deep Similarity Similarity Model¶
Notebook contents:
In this notebook, we'll implement a simple DSSM neural network model for recommendation systems.
We will be using embeddings for each unique user and item, group them together and pass them into linear layers, which will be our forward method for each individual 'tower'. The value in these embeddings is that they can be used in further models as features to create a ranker model.
Outputs of both 'towers' are multiplied together using a scalar product, to get the scores, similar to how it is done matrix factorisation approaches. These scores can be evaluated for all items and rearranged to get the top k recommendations for each user, based on the score value (section 9)
Training a neural network, we will be training the embedding layers in the context of a binary classification problem, in which we label the positive samples (items) as 1 and negative samples as 0. So we train a model that will be able to differentiate between positive and negative samples.
¶
- The dataset contains three data dataframes, user features, item features and the interaction details between user and item
- Items in this dataset are movies and serials
users_df = pd.read_csv('/kaggle/input/kion-dataset/users.csv') # user dataframe
items_df = pd.read_csv('/kaggle/input/kion-dataset/items.csv')
interactions_df = pd.read_csv('/kaggle/input/kion-dataset/interactions.csv') # user feature interaction dataframe
# 5476251 user/item interactions
interactions_df.shape
(5476251, 5)
¶
We need to filter items that have low interaction counts, as they probably won't be useful
# interactions 35% + only
# user interactions must be more than 10
# items must have been watched more than 10 times
print(f"N users before: {interactions_df.user_id.nunique()}")
print(f"N items before: {interactions_df.item_id.nunique()}\n")
# (1) пользователь посмотрел фильм менее чем на 35 процентов
interactions_df = interactions_df[interactions_df.watched_pct > 35]
# соберем всех пользователей, которые посмотрели
# больше 10 фильмов (можете выбрать другой порог)
valid_users = []
c = Counter(interactions_df.user_id)
for user_id, entries in c.most_common():
if entries > 10:
valid_users.append(user_id)
# и соберем все фильмы, которые посмотрели больше 10 пользователей
valid_items = []
c = Counter(interactions_df.item_id)
for item_id, entries in c.most_common():
if entries > 10:
valid_items.append(item_id)
# отбросим непопулярные фильмы и неактивных юзеров
interactions_df = interactions_df[interactions_df.user_id.isin(valid_users)]
interactions_df = interactions_df[interactions_df.item_id.isin(valid_items)]
print(f"Number of users after filtration: {interactions_df.user_id.nunique()}")
print(f"Number of items after filtration: {interactions_df.item_id.nunique()}")
N users before: 962179 N items before: 15706 Number of users after filtration: 54213 Number of items after filtration: 6140
# Common Users & Items in interactions/User/Items after filtration
# intersection between interactions & users; intersection between interactions & items
common_users = set(interactions_df.user_id.unique()).intersection(set(users_df.user_id.unique()))
print(len(common_users))
common_items = set(interactions_df.item_id.unique()).intersection(set(items_df.item_id.unique()))
print(len(common_items))
interactions_df = interactions_df[interactions_df.item_id.isin(common_items)]
interactions_df = interactions_df[interactions_df.user_id.isin(common_users)]
# filtered items & users (we keep only users/items in interactions)
items_df_filtered = items_df[items_df.item_id.isin(interactions_df['item_id'].unique())].copy()
users_df_filtered = users_df[users_df.user_id.isin(interactions_df['user_id'].unique())].copy()
44959 6140
¶
We will use standard encoding for item & user features
item_cat_feats
: item features to be used in modeluser_cat_feats
: user features to be used in model
'''
Item features which need to be converted
'''
item_cat_feats = ['content_type', 'release_year',
'for_kids', 'age_rating',
'studios', 'countries', 'directors']
display(items_df_filtered[item_cat_feats].head())
for col in item_cat_feats:
items_df_filtered[col] = items_df_filtered[col].fillna('unknown')
items_df_filtered[f'{col}_encoded'] = items_df_filtered[col].astype('category').cat.codes
content_type | release_year | for_kids | age_rating | studios | countries | directors | |
---|---|---|---|---|---|---|---|
8 | film | 2018.0 | NaN | 16.0 | NaN | Испания | Асгар Фархади |
10 | film | 2018.0 | NaN | 18.0 | NaN | Великобритания | Тревор Нанн |
16 | film | 2019.0 | NaN | 18.0 | NaN | США | Бретт Пирс, Дрю Т. Пирс |
20 | film | 2019.0 | NaN | 6.0 | NaN | США | NaN |
38 | film | 2004.0 | NaN | 12.0 | NaN | США | Дэвид Кепп |
user_cat_feats = ["age", "income", "sex", "kids_flg"]
for col in user_cat_feats:
users_df_filtered[col] = users_df_filtered[col].fillna('unknown')
users_df_filtered[f'{col}_encoded'] = users_df_filtered[col].astype('category').cat.codes
users_df_filtered[user_cat_feats].head()
age | income | sex | kids_flg | |
---|---|---|---|---|
24 | age_35_44 | income_20_40 | Ж | 1 |
27 | age_25_34 | income_40_60 | М | 1 |
66 | age_25_34 | income_20_40 | М | 0 |
81 | age_25_34 | income_20_40 | М | 0 |
136 | age_65_inf | income_20_40 | М | 0 |
¶
user_id
& item_id
are just numbers, lets create a mapper for new ids that start from 0
'''
Convert item / user ids to new normalised numbering from 0...
'''
# converted in order from 0 for user/items
interactions_df["uid"] = interactions_df["user_id"].astype("category")
interactions_df["uid"] = interactions_df["uid"].cat.codes
interactions_df["iid"] = interactions_df["item_id"].astype("category")
interactions_df["iid"] = interactions_df["iid"].cat.codes
# lets confirm they start from 0
print(sorted(interactions_df.iid.unique())[:5])
print(sorted(interactions_df.uid.unique())[:5])
[0, 1, 2, 3, 4] [0, 1, 2, 3, 4]
interactions_df.head()
user_id | item_id | last_watch_dt | total_dur | watched_pct | uid | iid | |
---|---|---|---|---|---|---|---|
0 | 176549 | 9506 | 2021-05-11 | 4250 | 72.0 | 7234 | 3500 |
1 | 699317 | 1659 | 2021-05-29 | 8317 | 100.0 | 28657 | 594 |
14 | 5324 | 8437 | 2021-04-18 | 6598 | 92.0 | 208 | 3084 |
18 | 927973 | 9617 | 2021-06-19 | 8422 | 100.0 | 37990 | 3536 |
20 | 896751 | 8081 | 2021-05-17 | 6358 | 100.0 | 36735 | 2953 |
# extract from interaction all mappers for items
iid_to_item_id = interactions_df[["iid", "item_id"]].drop_duplicates().set_index("iid").to_dict()["item_id"]
item_id_to_iid = interactions_df[["iid", "item_id"]].drop_duplicates().set_index("item_id").to_dict()["iid"]
# extract from interaction all mappers for users
uid_to_user_id = interactions_df[["uid", "user_id"]].drop_duplicates().set_index("uid").to_dict()["user_id"]
user_id_to_uid = interactions_df[["uid", "user_id"]].drop_duplicates().set_index("user_id").to_dict()["uid"]
# add iid to item dataframe
items_df_filtered["iid"] = items_df_filtered["item_id"].apply(lambda x: item_id_to_iid[x])
items_df_filtered = items_df_filtered.set_index("iid")
# add uid to user dataframe
users_df_filtered["uid"] = users_df_filtered["user_id"].apply(lambda x: user_id_to_uid[x])
users_df_filtered = users_df_filtered.set_index("uid")
¶
Using the interactions data, each row will have (user_id,item_id), from these ids, the user features for this user, and the film features they interacted with, as well as a random item features are returned
from torch.utils.data import Dataset, DataLoader
SEED = 42
class TupleDataset(Dataset):
def __init__(self,
user_pos_pairs: np.ndarray, # two dimentional user/item interaction
user_features: pd.DataFrame, # numerical user features
item_features: pd.DataFrame, # item features
n_negatives: int = 1) -> None:
self.user_pos_pairs = user_pos_pairs # user, item pair numpy matrix
self.user_features = user_features # user feature dataframe
self.item_features = item_features # item feature dataframe
self.all_items = item_features.index.values
self.rng = np.random.default_rng(SEED)
self.n_negatives = n_negatives
def __len__(self):
return len(self.user_pos_pairs)
def __getitem__(self, index):
# user value & user item value
user, pos = self.user_pos_pairs[index]
# for the user pick random item, it will be our negative
negative = self.rng.choice(self.all_items, size=self.n_negatives).item()
user_features = self.user_features.loc[user].to_dict() # user features
pos_features = self.item_features.loc[pos].to_dict() # positive item features
neg_features = self.item_features.loc[negative].to_dict() # random sample from user/item as negative sample(s)
return {
'user_features': user_features,
'pos_features': pos_features,
'neg_features': neg_features
}
from pytorch_lightning import LightningDataModule
class DssmDataModule(LightningDataModule):
def __init__(self,
train_ds: TupleDataset,
train_batch_size: int = 1):
super().__init__()
self.train_ds = train_ds
self.train_batch_size = train_batch_size
def train_dataloader(self):
return DataLoader(self.train_ds,
batch_size=self.train_batch_size)
¶
Select a subset of user and item dataframes, selecting only the encoded columns
# user / item features dataframe
user_feature_cols = [f'{col}_encoded' for col in user_cat_feats] # user features column names
item_feature_cols = [f'{col}_encoded' for col in item_cat_feats] # item feature column names
user_features = users_df_filtered[user_feature_cols]
item_features = items_df_filtered[item_feature_cols]
user_features.head()
age_encoded | income_encoded | sex_encoded | kids_flg_encoded | |
---|---|---|---|---|
uid | ||||
11047 | 2 | 2 | 1 | 1 |
15756 | 1 | 3 | 2 | 1 |
8852 | 1 | 2 | 2 | 0 |
21134 | 1 | 2 | 2 | 0 |
33752 | 5 | 2 | 2 | 0 |
User, item interaction matrix
# user / item interactions
pairs = interactions_df[['uid', 'iid']].values # user uid interacted with iid
pairs
array([[ 7234, 3500], [28657, 594], [ 208, 3084], ..., [17861, 4980], [24986, 2581], [15738, 6017]], dtype=int32)
¶
a) Dataset containing grouped positive item features, negative item festures and its user features
# create a dictionary of features for
# creates positive, negative feature samples
train_ds = TupleDataset(user_pos_pairs=pairs,
user_features=user_features,
item_features=item_features)
# example of data from TupleDataset
import pprint; pprint.pprint(train_ds[0])
{'neg_features': {'age_rating_encoded': 2, 'content_type_encoded': 1, 'countries_encoded': 459, 'directors_encoded': 1626, 'for_kids_encoded': 2, 'release_year_encoded': 87, 'studios_encoded': 24}, 'pos_features': {'age_rating_encoded': 0, 'content_type_encoded': 0, 'countries_encoded': 322, 'directors_encoded': 1946, 'for_kids_encoded': 2, 'release_year_encoded': 83, 'studios_encoded': 24}, 'user_features': {'age_encoded': 2, 'income_encoded': 3, 'kids_flg_encoded': 0, 'sex_encoded': 2}}
b) Create batched dataset
batch_size = 4
dm = DssmDataModule(train_ds=train_ds, # created dataset
train_batch_size=batch_size) # size of group
Sample of batch data
# next batch data
# next(iter(dm.train_dataloader())).keys()
batch = next(iter(dm.train_dataloader()))
pprint.pprint(batch)
{'neg_features': {'age_rating_encoded': tensor([2, 2, 4, 0]), 'content_type_encoded': tensor([1, 0, 0, 0]), 'countries_encoded': tensor([294, 332, 294, 322]), 'directors_encoded': tensor([ 618, 2630, 208, 3926]), 'for_kids_encoded': tensor([2, 2, 2, 2]), 'release_year_encoded': tensor([54, 90, 82, 73]), 'studios_encoded': tensor([24, 24, 24, 24])}, 'pos_features': {'age_rating_encoded': tensor([0, 1, 3, 2]), 'content_type_encoded': tensor([0, 0, 0, 0]), 'countries_encoded': tensor([322, 294, 323, 322]), 'directors_encoded': tensor([1946, 1749, 371, 559]), 'for_kids_encoded': tensor([2, 2, 2, 2]), 'release_year_encoded': tensor([83, 84, 89, 86]), 'studios_encoded': tensor([24, 24, 24, 24])}, 'user_features': {'age_encoded': tensor([2, 2, 0, 1]), 'income_encoded': tensor([3, 3, 2, 2]), 'kids_flg_encoded': tensor([0, 0, 0, 0]), 'sex_encoded': tensor([2, 2, 1, 2])}}
Set parameters for embedding matrices
N_FACTORS = 64 # number of factors in each embedding
CAT_EMBEDDING_DIM = 16
# в датасетах есть столбец user_id/item_id, помним, что он не является фичей для обучения!
ITEM_MODEL_SHAPE = len(item_feature_cols) * CAT_EMBEDDING_DIM
USER_META_MODEL_SHAPE = len(user_feature_cols) * CAT_EMBEDDING_DIM
# USER_INTERACTION_MODEL_SHAPE = (interactions_vec.shape[1], )
print(f"N_FACTORS: {N_FACTORS}")
print(f"ITEM_MODEL_SHAPE: {ITEM_MODEL_SHAPE}")
print(f"USER_META_MODEL_SHAPE: {USER_META_MODEL_SHAPE}")
N_FACTORS: 64 ITEM_MODEL_SHAPE: 112 USER_META_MODEL_SHAPE: 64
import torch.nn as nn
# inividual tower part of model
class DssmTower(nn.Module):
def __init__(self,
feat_vocab_sizes: dict[str, int],
cat_feat_emb_dim: int,
hidden_dim: int,
out_dim: int):
super().__init__()
# embeddings for each unique value in user / item
self.embedding = nn.ModuleDict({
cat_feat: nn.Embedding(cat_feat_vocab_size, cat_feat_emb_dim) for cat_feat, cat_feat_vocab_size in feat_vocab_sizes.items()
})
self.features = list(feat_vocab_sizes.keys())
self.layer_1 = nn.Linear(cat_feat_emb_dim * len(feat_vocab_sizes), hidden_dim)
self.layer_2 = nn.Linear(hidden_dim, hidden_dim)
self.layer_3 = nn.Linear(hidden_dim, out_dim)
def forward(self, batch):
# concatenate all embeddings (they have same embedding dimenion)
embeddings = []
for feature in self.features:
feature_embedding = self.embedding[feature](batch[feature])
embeddings.append(feature_embedding)
embedding = torch.cat(embeddings, dim=1)
layer_1 = self.layer_1(embedding)
layer_2 = self.layer_2(layer_1)
layer_2 += layer_1
output = self.layer_3(layer_2)
return output
# number of unique elements in column
user_feat_vocab_size = dict()
for col in user_features.columns:
user_feat_vocab_size[col] = user_features[col].nunique()
user_feat_vocab_size
{'age_encoded': 7, 'income_encoded': 7, 'sex_encoded': 3, 'kids_flg_encoded': 2}
# number of unique elements in column
item_feat_vocab_size = dict()
for col in item_features.columns:
item_feat_vocab_size[col] = item_features[col].nunique()
item_feat_vocab_size
{'content_type_encoded': 2, 'release_year_encoded': 92, 'for_kids_encoded': 3, 'age_rating_encoded': 6, 'studios_encoded': 28, 'countries_encoded': 535, 'directors_encoded': 4041}
from pytorch_lightning import LightningModule
from torch.optim import Adam
# Lightning Module
class DssmLitModule(LightningModule):
def __init__(self, user_tower: DssmTower,
item_tower: DssmTower,
optim_hparams: dict):
super().__init__()
self.user_tower = user_tower # neural network for users
self.item_tower = item_tower # neural network for items
self.optim_hparams = optim_hparams
self.loss = torch.nn.BCEWithLogitsLoss()
def training_step(self, batch, batch_idx):
# extract feature data from dictionary (dictionary format)
user_feats = batch['user_features'] # feature : tensor values of batch
pos_feats = batch['pos_features'] # ''
neg_feats = batch['neg_features'] # ''
# for each user/item input into each tower & activate (forward method)
# for each batch item tower, one logits item for usrfeat/posfeat/negfeat x batch number
user_embs = self.user_tower(user_feats) # (batch_size x out_dim)
pos_embs = self.item_tower(pos_feats) # (batch_size x out_dim)
neg_embs = self.item_tower(neg_feats) # (batch_size x out_dim)
# (batch_size x 2 x out_dim) -> for both positive / negative samples
item_embs = torch.cat((pos_embs.unsqueeze(1),
neg_embs.unsqueeze(1)), dim=1)
# dot scores for both positive & negative samples
dot_scores = (user_embs.unsqueeze(1) @ item_embs.transpose(1, 2)).squeeze()
# everything in second dimension is negative sample (item) (label = 0)
labels = torch.zeros_like(dot_scores) # (zeros of shape 2 x out_dims)
labels[:, 0] = 1 # everything in first dimension is positive sample (item) (label = 1)
loss = self.loss(dot_scores, labels) # loss per batch
self.log('train_loss',
loss.item(),
on_epoch=True,
on_step=True,
prog_bar=True)
return loss
def configure_optimizers(self):
return Adam(self.parameters(), **self.optim_hparams)
Define the tower parts of the neural network. In our model, we will be training two towers, one for user embedding features, and another for the item embedding.
# user tower of nn (input into dssm_module)
user_tower = DssmTower(
feat_vocab_sizes=user_feat_vocab_size,
cat_feat_emb_dim=CAT_EMBEDDING_DIM,
hidden_dim=64,
out_dim=64 # output dimension for each tower
)
# item tower segment of nn (input into dssm_module)
item_tower = DssmTower(
feat_vocab_sizes=item_feat_vocab_size,
cat_feat_emb_dim=CAT_EMBEDDING_DIM,
hidden_dim=64,
out_dim=64 # output dimension for each tower
)
# entire network using Lightning Module
dssm_module = DssmLitModule(user_tower,
item_tower,
optim_hparams={'lr': 1e-3})
# (user_tower) 16 x 4 -> 64
# (item_tower) 16 x 7 -> 112 features
# one batch prediction
with torch.no_grad():
print(dssm_module.training_step(batch, 0))
# # dot scores for batch 4
# tensor([[-0.1367, -0.7055],
# [ 0.8072, 1.3774],
# [ 0.4891, 0.5485],
# [ 1.1119, 1.1324]])
# # labels
# tensor([[1., 0.],
# [1., 0.],
# [1., 0.],
# [1., 0.]])
# # loss per batch
# tensor(0.7894)
tensor(1.1566)
¶
In this example, we will train only one epoch, just as an example
from pytorch_lightning import Trainer
# trainer = Trainer(max_epochs=1, enable_checkpointing=False)
# trainer.fit(dssm_module, dm)
¶
- Once we have a trained model, the class
dssm_module
contains the updated model weights - We can for a particular user and film combination, evaluate the score. We could repeat this process for all the items, and find he top k films, which we can recommend the user
# select user
# берем рандомного юзера
rand_uid = np.random.choice(list(user_features.index))
# получаем фичи юзера и вектор его просмотров айтемов
rand_uid_feats = user_features.loc[rand_uid].to_dict()
# select item
# берем рандомный айтем
rand_iid = np.random.choice(list(item_features.index))
# получаем фичи айтема
rand_iid_feats = item_features.loc[rand_iid].to_dict()
print('random user',rand_uid)
print('random user features')
pprint.pprint(rand_uid_feats)
print('')
print('random item',rand_iid)
print('random item features')
pprint.pprint(rand_iid_feats)
random user 40981 random user features {'age_encoded': 1, 'income_encoded': 3, 'kids_flg_encoded': 0, 'sex_encoded': 1} random item 5654 random item features {'age_rating_encoded': 3, 'content_type_encoded': 0, 'countries_encoded': 322, 'directors_encoded': 3012, 'for_kids_encoded': 2, 'release_year_encoded': 70, 'studios_encoded': 24}
# create a tensor
for key in rand_uid_feats:
rand_uid_feats[key] = torch.tensor([rand_uid_feats[key]])
for key in rand_iid_feats:
rand_iid_feats[key] = torch.tensor([rand_iid_feats[key]])
pprint.pprint(rand_uid_feats)
{'age_encoded': tensor([1]), 'income_encoded': tensor([3]), 'kids_flg_encoded': tensor([0]), 'sex_encoded': tensor([1])}
Using the tower classes, calculate the forward pass prediction using existing model weights
dssm_module.user_tower(rand_uid_feats).size()
torch.Size([1, 64])
dssm_module.item_tower(rand_iid_feats).size()
torch.Size([1, 64])
Calculate the dot scores for user/item combination, this can be done a couple of ways
(dssm_module.user_tower(rand_uid_feats) * dssm_module.item_tower(rand_iid_feats)).sum()
tensor(-2.4099, grad_fn=<SumBackward0>)
dssm_module.user_tower(rand_uid_feats) @ dssm_module.item_tower(rand_iid_feats).transpose(0,1)
tensor([[-2.4099]], grad_fn=<MmBackward0>)