Ranker CatBoost
!pip install catboost category_encoders sentence_transformers -qqq
/opt/conda/lib/python3.10/pty.py:89: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. pid, fd = os.forkpty() huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
%matplotlib inline
import pandas as pd
from catboost import CatBoostRanker, Pool
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import random
from sklearn.metrics import ndcg_score
from category_encoders.cat_boost import CatBoostEncoder
from sklearn.decomposition import PCA
from sentence_transformers import SentenceTransformer
1 | Задача ранжирования¶
▎Ранжирования Книг¶
Задача ранжирования
в рекомендательных системах заключается в том, чтобы упорядочить элементы (например, товары, фильмы, статьи и т. д.) на основе их релевантности для конкретного пользователя.- Основная цель — предоставить пользователю наиболее подходящие и интересные рекомендации, чтобы улучшить его опыт взаимодействия с системой.
- В этом примере будем использовать датасет реитингов различных книг. По сравнений с предыдущем ноутбуке, в этом датасете около миллиона рейтингов книг пользователей.
2 | Загрузка данных¶
▎Пользовательские взаимодействий¶
User-ID
: ПользовательISBN
: КнигаBook-Rating
: Поставленный рейтинг
В этом датасете у нас только одна фича взаимодействий пользователей с книгами, это рейтинги книг которые поставили пользователи Book-Rating
, это и будет наша целевая переменная
ratings_df = pd.read_csv('/kaggle/input/book-recommendation-dataset/Ratings.csv')
ratings_df.head()
User-ID | ISBN | Book-Rating | |
---|---|---|---|
0 | 276725 | 034545104X | 0 |
1 | 276726 | 0155061224 | 5 |
2 | 276727 | 0446520802 | 0 |
3 | 276729 | 052165615X | 3 |
4 | 276729 | 0521795028 | 6 |
Количество взаимодействий около миллион строк
ratings_df.shape
(1149780, 3)
▎Пользователи¶
Так же у нас есть несколько фич о самих пользователей которые ставили рейтинги
users_df = pd.read_csv('/kaggle/input/book-recommendation-dataset/Users.csv')
users_df.head()
User-ID | Location | Age | |
---|---|---|---|
0 | 1 | nyc, new york, usa | NaN |
1 | 2 | stockton, california, usa | 18.0 |
2 | 3 | moscow, yukon territory, russia | NaN |
3 | 4 | porto, v.n.gaia, portugal | 17.0 |
4 | 5 | farnborough, hants, united kingdom | NaN |
▎Предметы¶
Как и прошлый раз наши предметы это книги,
books_df = pd.read_csv('/kaggle/input/book-recommendation-dataset/Books.csv')
books_df.head()
/tmp/ipykernel_30/3442253096.py:1: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False. books_df = pd.read_csv('/kaggle/input/book-recommendation-dataset/Books.csv')
ISBN | Book-Title | Book-Author | Year-Of-Publication | Publisher | Image-URL-S | Image-URL-M | Image-URL-L | |
---|---|---|---|---|---|---|---|---|
0 | 0195153448 | Classical Mythology | Mark P. O. Morford | 2002 | Oxford University Press | http://images.amazon.com/images/P/0195153448.0... | http://images.amazon.com/images/P/0195153448.0... | http://images.amazon.com/images/P/0195153448.0... |
1 | 0002005018 | Clara Callan | Richard Bruce Wright | 2001 | HarperFlamingo Canada | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... |
2 | 0060973129 | Decision in Normandy | Carlo D'Este | 1991 | HarperPerennial | http://images.amazon.com/images/P/0060973129.0... | http://images.amazon.com/images/P/0060973129.0... | http://images.amazon.com/images/P/0060973129.0... |
3 | 0374157065 | Flu: The Story of the Great Influenza Pandemic... | Gina Bari Kolata | 1999 | Farrar Straus Giroux | http://images.amazon.com/images/P/0374157065.0... | http://images.amazon.com/images/P/0374157065.0... | http://images.amazon.com/images/P/0374157065.0... |
4 | 0393045218 | The Mummies of Urumchi | E. J. W. Barber | 1999 | W. W. Norton & Company | http://images.amazon.com/images/P/0393045218.0... | http://images.amazon.com/images/P/0393045218.0... | http://images.amazon.com/images/P/0393045218.0... |
▎Объеденим данные¶
Добавим фичи пользщователя и книг к данным рейтингов
df = pd.merge(ratings_df, users_df, on='User-ID', how='left')
df = pd.merge(books_df, df, on='ISBN', how='left')
df.head()
ISBN | Book-Title | Book-Author | Year-Of-Publication | Publisher | Image-URL-S | Image-URL-M | Image-URL-L | User-ID | Book-Rating | Location | Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0195153448 | Classical Mythology | Mark P. O. Morford | 2002 | Oxford University Press | http://images.amazon.com/images/P/0195153448.0... | http://images.amazon.com/images/P/0195153448.0... | http://images.amazon.com/images/P/0195153448.0... | 2.0 | 0.0 | stockton, california, usa | 18.0 |
1 | 0002005018 | Clara Callan | Richard Bruce Wright | 2001 | HarperFlamingo Canada | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... | 8.0 | 5.0 | timmins, ontario, canada | NaN |
2 | 0002005018 | Clara Callan | Richard Bruce Wright | 2001 | HarperFlamingo Canada | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... | 11400.0 | 0.0 | ottawa, ontario, canada | 49.0 |
3 | 0002005018 | Clara Callan | Richard Bruce Wright | 2001 | HarperFlamingo Canada | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... | 11676.0 | 8.0 | n/a, n/a, n/a | NaN |
4 | 0002005018 | Clara Callan | Richard Bruce Wright | 2001 | HarperFlamingo Canada | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... | 41385.0 | 0.0 | sudbury, ontario, canada | NaN |
df['Book-Author'] = df['Book-Author'].fillna('unknown')
df['Publisher'] = df['Publisher'].fillna('unknown')
# Все уникальные годы выпуска книги
df['Year-Of-Publication'].unique()
array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994, 2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980, 1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974, 1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960, 1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954, 1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011, 1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030, 1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934, 1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901, 2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004', '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993', '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996', '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988', '1963', '1956', '1970', '1985', '1978', '1973', '1980', '1979', '1975', '1969', '1961', '1965', '1939', '1958', '1950', '1953', '1966', '1971', '1959', '1972', '1955', '1957', '1945', '1960', '1967', '1932', '1924', '1964', '2012', '1911', '1927', '1948', '1962', '2006', '1952', '1940', '1951', '1931', '1954', '2005', '1930', '1941', '1944', 'DK Publishing Inc', '1943', '1938', '1900', '1942', '1923', '1920', '1933', 'Gallimard', '1909', '1946', '2008', '1378', '2030', '1936', '1947', '2011', '2020', '1919', '1949', '1922', '1897', '2024', '1376', '1926', '2037'], dtype=object)
df['Year-Of-Publication'].value_counts().iloc[:10]
Year-Of-Publication 2002 87297 2001 75328 1999 70228 2003 69235 2000 67595 1998 59655 1997 55196 1996 54716 1995 49964 1994 42765 Name: count, dtype: int64
print(df.shape)
df['Year-Of-Publication'] = pd.to_numeric(df['Year-Of-Publication'], errors='coerce')
df['Year-Of-Publication'].unique()
(1032345, 12)
array([2002., 2001., 1991., 1999., 2000., 1993., 1996., 1988., 2004., 1998., 1994., 2003., 1997., 1983., 1979., 1995., 1982., 1985., 1992., 1986., 1978., 1980., 1952., 1987., 1990., 1981., 1989., 1984., 0., 1968., 1961., 1958., 1974., 1976., 1971., 1977., 1975., 1965., 1941., 1970., 1962., 1973., 1972., 1960., 1966., 1920., 1956., 1959., 1953., 1951., 1942., 1963., 1964., 1969., 1954., 1950., 1967., 2005., 1957., 1940., 1937., 1955., 1946., 1936., 1930., 2011., 1925., 1948., 1943., 1947., 1945., 1923., 2020., 1939., 1926., 1938., 2030., 1911., 1904., 1949., 1932., 1928., 1929., 1927., 1931., 1914., 2050., 1934., 1910., 1933., 1902., 1924., 1921., 1900., 2038., 2026., 1944., 1917., 1901., 2010., 1908., 1906., 1935., 1806., 2021., 2012., 2006., nan, 1909., 2008., 1378., 1919., 1922., 1897., 2024., 1376., 2037.])
df['Age'] = np.where(df['Age'] > 100, None, df['Age'])
df['Year-Of-Publication'] = np.where(
df['Year-Of-Publication'] <= 0,
np.nanmedian(df['Year-Of-Publication']),
df['Year-Of-Publication']
).clip(0, 2021).astype(str)
df = df[df['Book-Rating'] > 0]
df['city'] = df['Location'].apply(lambda x: x.split(',')[0].strip())
df['state'] = df['Location'].apply(lambda x: x.split(',')[1].strip())
df['country'] = df['Location'].apply(lambda x: x.split(',')[2].strip())
# создадим список уникальных пользователей
users = df['User-ID'].unique()
random.shuffle(users)
# разделим пользователей на train, validation и test в пропорции 0.7 : 0.1 : 0.2
train_users = users[:int(0.7*len(users))]
val_users = users[int(0.7*len(users)):int(0.8*len(users))]
test_users = users[int(0.8*len(users)):]
# train, val и test df
train_df = df[df['User-ID'].isin(train_users)]
val_df = df[df['User-ID'].isin(val_users)]
test_df = df[df['User-ID'].isin(test_users)]
▎Создание эмбеддингов¶
Создадим эмбеддинговые фичи из названия книги Book-Title
train_df['Book-Title']
1 Clara Callan 3 Clara Callan 5 Clara Callan 8 Clara Callan 11 Clara Callan ... 1032308 You Got an Ology 1032310 Illustrated Encyclopedia of Cacti 1032314 Lewis Carroll: A Traves Del Espejo Y Lo Que Al... 1032337 Cocktail Classics 1032339 Flashpoints: Promise and Peril in a New World Name: Book-Title, Length: 271753, dtype: object
# инициализируем модель для работы с текстом
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn(
# создаем train, val и test эмбеддинги
train_books = train_df.loc[:, ["ISBN", "Book-Title"]].drop_duplicates()
val_books = val_df.loc[:, ["ISBN", "Book-Title"]].drop_duplicates()
test_books = test_df.loc[:, ["ISBN", "Book-Title"]].drop_duplicates()
train_embeddings = model.encode(train_books["Book-Title"].tolist(), normalize_embeddings=True)
val_embeddings = model.encode(val_books["Book-Title"].tolist(), normalize_embeddings=True)
test_embeddings = model.encode(test_books["Book-Title"].tolist(), normalize_embeddings=True)
Batches: 0%| | 0/3707 [00:00<?, ?it/s]
Batches: 0%| | 0/860 [00:00<?, ?it/s]
Batches: 0%| | 0/1398 [00:00<?, ?it/s]
train_embeddings
array([[-0.03592037, -0.03050607, -0.00902679, ..., -0.02955211, -0.038664 , 0.0858156 ], [-0.02361225, 0.08952442, -0.01865784, ..., -0.01826494, -0.01039198, 0.06050469], [-0.03252224, 0.00791598, -0.05199765, ..., -0.01043319, -0.05725905, 0.02445595], ..., [ 0.00055834, -0.00024567, 0.00981628, ..., -0.01239159, -0.01077607, -0.01399653], [-0.00267682, -0.02064378, 0.00749625, ..., 0.03351401, 0.02761846, 0.02537339], [-0.03020654, -0.05159961, 0.04820484, ..., 0.01754397, -0.06948989, 0.02908983]], dtype=float32)
train_embeddings.shape
(118618, 384)
Модель содержит слишком большую размерность векторов поэтому воспользуемся методом понижения размерности
# сократим размерность с PCA
pca = PCA(n_components=0.8, random_state=42)
train_embeddings = pca.fit_transform(train_embeddings)
val_embeddings = pca.transform(val_embeddings)
test_embeddings = pca.transform(test_embeddings)
train_embeddings.shape
(118618, 87)
train_books = train_df.loc[:, ["ISBN", "Book-Title"]].drop_duplicates()
# добавим эмбеддинги в виде признаков
def add_embeddings(df, embeddings, books):
embeddings_df = pd.DataFrame(embeddings)
embeddings_df.columns = [f"Book-Title_{i}" for i in embeddings_df.columns]
books = pd.merge(books.reset_index(drop=True), embeddings_df, left_index=True, right_index=True)
return pd.merge(df, books, on=["ISBN", "Book-Title"])
train_df = add_embeddings(train_df, train_embeddings, train_books)
val_df = add_embeddings(val_df, val_embeddings, val_books)
test_df = add_embeddings(test_df, test_embeddings, test_books)
train_df.shape
(271753, 102)
▎Трансформация категориальных признаков¶
Воспользуемся методом из библиотеки Catboost CatBoostEncoder
# Фичи которые не будем использовать
EXCLUDE_FEATURES = ['city', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L', 'User-ID', 'ISBN', 'Location','Book-Title','Book-Rating']
# Категориальные фичи которые будем кодировать
CATEGORICAL_FEATURES = ['Book-Author', 'Year-Of-Publication', 'Publisher', 'state', 'country']
# Целевая переменная
TARGET = ['Book-Rating']
encoder = CatBoostEncoder()
train_df[CATEGORICAL_FEATURES] = encoder.fit_transform(train_df[CATEGORICAL_FEATURES],
train_df[TARGET])
val_df[CATEGORICAL_FEATURES] = encoder.transform(val_df[CATEGORICAL_FEATURES])
test_df[CATEGORICAL_FEATURES] = encoder.transform(test_df[CATEGORICAL_FEATURES])
Warning: No categorical columns found. Calling 'transform' will only return input data.
Все фичи которые мы будем подавать в модель ранжирования
FEATURES = [feat for feat in train_df.columns if feat not in EXCLUDE_FEATURES]
print('\n',', '.join(map(repr, FEATURES)))
'Book-Author', 'Year-Of-Publication', 'Publisher', 'Age', 'state', 'country', 'Book-Title_0', 'Book-Title_1', 'Book-Title_2', 'Book-Title_3', 'Book-Title_4', 'Book-Title_5', 'Book-Title_6', 'Book-Title_7', 'Book-Title_8', 'Book-Title_9', 'Book-Title_10', 'Book-Title_11', 'Book-Title_12', 'Book-Title_13', 'Book-Title_14', 'Book-Title_15', 'Book-Title_16', 'Book-Title_17', 'Book-Title_18', 'Book-Title_19', 'Book-Title_20', 'Book-Title_21', 'Book-Title_22', 'Book-Title_23', 'Book-Title_24', 'Book-Title_25', 'Book-Title_26', 'Book-Title_27', 'Book-Title_28', 'Book-Title_29', 'Book-Title_30', 'Book-Title_31', 'Book-Title_32', 'Book-Title_33', 'Book-Title_34', 'Book-Title_35', 'Book-Title_36', 'Book-Title_37', 'Book-Title_38', 'Book-Title_39', 'Book-Title_40', 'Book-Title_41', 'Book-Title_42', 'Book-Title_43', 'Book-Title_44', 'Book-Title_45', 'Book-Title_46', 'Book-Title_47', 'Book-Title_48', 'Book-Title_49', 'Book-Title_50', 'Book-Title_51', 'Book-Title_52', 'Book-Title_53', 'Book-Title_54', 'Book-Title_55', 'Book-Title_56', 'Book-Title_57', 'Book-Title_58', 'Book-Title_59', 'Book-Title_60', 'Book-Title_61', 'Book-Title_62', 'Book-Title_63', 'Book-Title_64', 'Book-Title_65', 'Book-Title_66', 'Book-Title_67', 'Book-Title_68', 'Book-Title_69', 'Book-Title_70', 'Book-Title_71', 'Book-Title_72', 'Book-Title_73', 'Book-Title_74', 'Book-Title_75', 'Book-Title_76', 'Book-Title_77', 'Book-Title_78', 'Book-Title_79', 'Book-Title_80', 'Book-Title_81', 'Book-Title_82', 'Book-Title_83', 'Book-Title_84', 'Book-Title_85', 'Book-Title_86'
▎Создаем Catboost Pool¶
Будем использовать формат Pool()
из библиотеки Catboost
# Catboost Ranker требует сортировки по группам (в нашем случае по пользователям)
train_df = train_df.sort_values(by='User-ID')
val_df = val_df.sort_values(by='User-ID')
test_df = test_df.sort_values(by='User-ID')
train_df['User-ID'] = train_df['User-ID'].astype(str)
val_df['User-ID'] = val_df['User-ID'].astype(str)
test_df['User-ID'] = test_df['User-ID'].astype(str)
train_pool = Pool(
data=train_df[FEATURES],
label=train_df[TARGET],
group_id=train_df['User-ID'].tolist(),
)
val_pool = Pool(
data=val_df[FEATURES],
label=val_df[TARGET],
group_id=val_df["User-ID"].tolist(),
)
test_pool = Pool(
data=test_df[FEATURES],
group_id=test_df["User-ID"].tolist(),
)
model = CatBoostRanker(loss_function="YetiRank",
verbose=100)
model.fit(train_pool,
eval_set=val_pool,
early_stopping_rounds=100)
0: test: 0.9793909 best: 0.9793909 (0) total: 787ms remaining: 13m 6s 100: test: 0.9847290 best: 0.9847870 (88) total: 1m 19s remaining: 11m 51s 200: test: 0.9848414 best: 0.9849120 (194) total: 2m 37s remaining: 10m 26s 300: test: 0.9850033 best: 0.9850304 (237) total: 3m 55s remaining: 9m 6s 400: test: 0.9852593 best: 0.9852885 (396) total: 5m 13s remaining: 7m 48s 500: test: 0.9852816 best: 0.9853377 (410) total: 6m 30s remaining: 6m 29s Stopped by overfitting detector (100 iterations wait) bestTest = 0.9853376892 bestIteration = 410 Shrink model to first 411 iterations.
<catboost.core.CatBoostRanker at 0x7b1d1172b340>
▎Ранжирование на тестовой выборке¶
test_df["score"] = model.predict(test_pool)
test_df[['ISBN','Book-Title','Book-Rating','score']].head(10)
ISBN | Book-Title | Book-Rating | score | |
---|---|---|---|---|
194 | 0345402871 | Airframe | 9.0 | -0.190028 |
34957 | 0891075275 | Piercing the Darkness | 6.0 | 0.156112 |
30245 | 0891076182 | Prophet | 3.0 | -0.064853 |
45547 | 0553264990 | Bant/Spec.Last of the Breed | 5.0 | 0.038086 |
32009 | 0425099148 | Death in the Clouds | 7.0 | 0.082281 |
246 | 0375759778 | Prague : A Novel | 7.0 | -0.347915 |
333 | 0553582747 | From the Corner of His Eye | 7.0 | 0.064243 |
517 | 0375410538 | Anil's Ghost | 5.0 | -0.156534 |
520 | 0966986105 | Prescription for Terror | 10.0 | -0.253021 |
525 | 0553062042 | Daybreakers Louis Lamour Collection | 7.0 | -0.227025 |
▎Оцениваем важность признаков¶
feature_importance = model.get_feature_importance(data=train_pool, verbose=0)
feature_importance_df = (
pd.DataFrame(
feature_importance,
index=FEATURES,
columns=["Importance"],
)
.sort_values(by="Importance", ascending=False)
.reset_index()
)
feature_importance_df
index | Importance | |
---|---|---|
0 | state | 9.617470e-05 |
1 | Book-Title_3 | 9.159373e-05 |
2 | Book-Title_56 | 5.298438e-05 |
3 | country | 5.202056e-05 |
4 | Book-Title_24 | 5.140638e-05 |
... | ... | ... |
88 | Book-Title_52 | 1.224288e-07 |
89 | Age | -1.621962e-06 |
90 | Book-Title_25 | -3.205877e-06 |
91 | Book-Title_10 | -7.567529e-06 |
92 | Book-Author | -4.965173e-04 |
93 rows × 2 columns
▎Метрика для оценки ранжирования¶
NDCG (Normalized Discounted Cumulative Gain) — это метрика, используемая для оценки качества ранжирования в рекомендательных системах и информационном поиске. Она позволяет измерить, насколько хорошо система возвращает релевантные результаты в заданном порядке.
users_count = test_df.groupby('User-ID')['ISBN'].count()
users = users_count[users_count.values > 1].index
def get_user_ndcg(user_id):
true_relevance = np.asarray([test_df[test_df['User-ID'] == user_id][TARGET[0]].tolist()])
y_relevance = np.asarray([test_df[test_df['User-ID'] == user_id]['score'].tolist()])
return ndcg_score(true_relevance, y_relevance)
ndcg_scores = []
for user_id in users:
ndcg_scores.append(get_user_ndcg(user_id))
np.mean(ndcg_scores)
0.9651886257470081