Задача Look-a-Like¶
1 | Вводная¶
Задача look-a-like¶
Цель задачи look-a-like заключается в том что вы хотим найти похожих пользователей.
- Это нужно для того чтобы можно было сегментировать пользователей и предпринимать какие то последующие действия на основе этой информации
Задача предсказания оттока¶
Будем строить модель для определения клиентов которые могут уйти
- Задачу look-a-like будем решать на примере поиска сегмента клиентов,
склонных к оттокуиз банка. - Датасет содержит ретро-данные о клиентах, оттекших из банка - целевой сегмент. Аналогично - имеются данные о тех, кто не оттек.
- Необходимо для любого другого клиента из тестовой выборки определить вероятность (склонность к оттоку).
Задача: построить модель классификации с предельно большим значением ROC-AUC
import pandas as pd
import numpy as np
import seaborn as sns
import random
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.utils import shuffle
import warnings
warnings.filterwarnings("ignore")
# Установка настроек для отображения всех колонок и строк при печати
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# заранее установим в константу random_state
random_state = 47
sns.set(style="whitegrid")
2 | Чтение данных¶
churn = pd.read_csv('/content/Churn_Modelling.csv')
print(churn.shape)
churn.head()
(10000, 14)
| RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
Признаки¶
RowNumber— индекс строки в данныхCustomerId— уникальный идентификатор клиентаSurname— фамилияCreditScore— кредитный рейтингGeography— страна проживанияGender— полAge— возрастTenure— сколько лет человек является клиентом банкаBalance— баланс на счётеNumOfProducts— количество продуктов банка, используемых клиентомHasCrCard— наличие кредитной картыIsActiveMember— активность клиентаEstimatedSalary— предполагаемая зарплата
Целевой признак¶
Exited— факт ухода клиента (1 - Отток)
Сразу исключим ненужные столбцы, чтобы модели не переобучались под пользователей:
churn = churn.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
churn.head()
| CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
3 | Визуализация Данных¶
# Создание подграфиков
fig, axes = plt.subplots(4, 2, figsize=(10, 13))
# Гистограмма кредитных баллов
sns.histplot(data=churn, x='CreditScore', kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Credit Scores')
# Гистограмма баланса на счёте
sns.histplot(data=churn, x='Balance', kde=True, ax=axes[0, 1])
axes[0, 1].set_title('Distribution of Balance')
# Гистограмма зарплат
sns.histplot(data=churn, x='EstimatedSalary', kde=True, ax=axes[1, 0])
axes[1, 0].set_title('Distribution of EstimatedSalary')
# Гистограмма возраста
sns.histplot(data=churn, x='Age', kde=True, ax=axes[1, 1])
axes[1, 1].set_title('Distribution of Age')
# Боксплот баланса по странам
sns.boxplot(x='Geography', y='Balance', data=churn, ax=axes[2, 0])
axes[2, 0].set_title('Balance by Geography')
# Боксплот зарплаты по полу
sns.boxplot(x='Gender', y='EstimatedSalary', data=churn, ax=axes[2, 1])
axes[2, 1].set_title('Estimated Salary by Gender')
# Круговая диаграмма для колонки Exited (отточников)
exited_counts = churn['Exited'].value_counts()
axes[3, 0].pie(exited_counts, labels=exited_counts.index, autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff'])
axes[3, 0].set_title('Exited Distribution')
# Распределение активных членов по странам
sns.countplot(x='Geography', hue='IsActiveMember', data=churn, ax=axes[3, 1])
axes[3, 1].set_title('Active Members by Geography')
# Подгонка и отображение графиков
plt.tight_layout()
plt.show()
# One-Hot для логрега (для зелени тоже подходит)
churn = pd.get_dummies(churn, drop_first=True)
print(churn.shape)
churn.head()
(10000, 12)
| CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | Geography_Germany | Geography_Spain | Gender_Male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 619 | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 | False | False | False |
| 1 | 608 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 | False | True | False |
| 2 | 502 | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 | False | False | False |
| 3 | 699 | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 | False | False | False |
| 4 | 850 | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 | False | True | False |
5 | Train/Test подвыборки¶
Делим на выборки:
features = churn.drop(['Exited'], axis=1)
target = churn['Exited']
# мощность классов
target.value_counts()
| count | |
|---|---|
| Exited | |
| 0 | 7963 |
| 1 | 2037 |
target.mean()
0.2037
# отделяем 20% - пятую часть всего - на тестовую выборку
X_train_valid, X_test, y_train_valid, y_test = train_test_split(features, target,
test_size=0.2,
random_state=random_state)
# отделяем 25% - четвертую часть трейн+валид - на валидирующую выборку
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid,
test_size=0.25,
random_state=random_state)
s1 = y_train.size
s2 = y_valid.size
s3 = y_test.size
print('Разбиение на выборки train:valid:test в соотношении '
+ str(round(s1/s3)) + ':' + str(round(s2/s3)) + ':' + str(round(s3/s3)))
Разбиение на выборки train:valid:test в соотношении 3:1:1
targets = [y_train, y_train_valid, y_valid, y_test]
names = ['train:', 'train+valid:', 'valid:', 'test:']
print('Баланс классов на разбиениях:\n')
i = 0
for target in targets:
pc = target.mean()
print(names[i], pc.round(3))
i += 1
Баланс классов на разбиениях: train: 0.201 train+valid: 0.202 valid: 0.204 test: 0.212
6 | Маштабирование¶
- Среди моделей, выбранных для исследования, есть линейные; Качество линейных алгоритмов зависит от масштаба данных. Признаки должны быть нормализованы.
- Если масштаб одного признака сильно превосходит масштаб других, то качество может резко упасть.
- Для нормализации используем стандартизацию признаков: возьмем набор значений признака на всех объектах, вычислим их среднее значение и стандартное отклонение.
- После этого из всех значений признака вычтем среднее, и затем полученную разность поделим на стандартное отклонение. Сделает это StandardScaler()...
# Выделяем количественные признаки для стандартизации
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary']
# Параметры стандартизации получим на трейне
scaler = StandardScaler()
scaler.fit(X_train[numeric])
# Преобразуем все выборки на основе параметров, полученных выше
X_train[numeric] = scaler.transform(X_train[numeric])
X_valid[numeric] = scaler.transform(X_valid[numeric])
X_test[numeric] = scaler.transform(X_test[numeric])
X_train_valid[numeric] = scaler.fit_transform(X_train_valid[numeric])
X_train[numeric].describe().round(3)
| CreditScore | Age | Tenure | Balance | EstimatedSalary | |
|---|---|---|---|---|---|
| count | 6000.000 | 6000.000 | 6000.000 | 6000.000 | 6000.000 |
| mean | 0.000 | 0.000 | 0.000 | -0.000 | -0.000 |
| std | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| min | -3.136 | -1.994 | -1.730 | -1.236 | -1.753 |
| 25% | -0.690 | -0.663 | -0.695 | -1.236 | -0.860 |
| 50% | 0.007 | -0.187 | -0.004 | 0.335 | 0.009 |
| 75% | 0.693 | 0.478 | 1.031 | 0.821 | 0.855 |
| max | 2.067 | 5.042 | 1.721 | 2.807 | 1.729 |
- Значения по выбранным количественным признакам теперь выглядят немного неадекватно,
- зато с нулевым средним и среднеквадратичным, равным 1.
# Функция для оценки модели
def calc_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test):
# Обучение
y_train_pred = model.predict(X_train)
y_train_proba = model.predict_proba(X_train)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_train)
# Валидация
y_valid_pred = model.predict(X_valid)
y_valid_proba = model.predict_proba(X_valid)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_valid)
# Тестирование
y_test_pred = model.predict(X_test)
y_test_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_test)
train_metrics = {
'precision': precision_score(y_train, y_train_pred),
'recall': recall_score(y_train, y_train_pred),
'f1': f1_score(y_train, y_train_pred),
'roc_auc': roc_auc_score(y_train, y_train_proba)
}
valid_metrics = {
'precision': precision_score(y_valid, y_valid_pred),
'recall': recall_score(y_valid, y_valid_pred),
'f1': f1_score(y_valid, y_valid_pred),
'roc_auc': roc_auc_score(y_valid, y_valid_proba)
}
test_metrics = {
'precision': precision_score(y_test, y_test_pred),
'recall': recall_score(y_test, y_test_pred),
'f1': f1_score(y_test, y_test_pred),
'roc_auc': roc_auc_score(y_test, y_test_proba)
}
return train_metrics, valid_metrics, test_metrics
def print_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test):
res = calc_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test)
metrics = pd.DataFrame(res, index=['train', 'valid', 'test']).round(3)
return metrics
%%time
"""
Ручной перебор всех комбинации параметров
- наилучшию модель выбираем по метрике ROC-AUC
"""
param_grid = {
'penalty': ['l2', 'l1', 'elasticnet'], # Тип регурелизации
# 'solver': ['lbfgs', 'liblinear', 'saga'],
'C': np.linspace(0.001, 2, 50) # Параметр который отвечает за вес этих поправок
}
grid_search = GridSearchCV(LogisticRegression(random_state=42),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1)
grid_search.fit(X_train_valid, y_train_valid)
best_logreg = grid_search.best_estimator_
print(best_logreg.get_params())
{'C': 0.04179591836734694, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
CPU times: user 632 ms, sys: 143 ms, total: 774 ms
Wall time: 7.9 s
res_logreg = print_metrics(best_logreg, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_logreg
| precision | recall | f1 | roc_auc | |
|---|---|---|---|---|
| train | 0.615 | 0.201 | 0.303 | 0.774 |
| valid | 0.617 | 0.174 | 0.272 | 0.765 |
| test | 0.644 | 0.201 | 0.306 | 0.754 |
# Получение коэффициентов модели
coef = best_logreg.coef_[0]
# Создание DataFrame для коэффициентов и признаков
coef_df = pd.DataFrame({
'Feature': X_train_valid.columns,
'Coefficient_logreg': coef.round(3)
})
# Добавление столбца с интерпретацией
coef_df['Interpretation_logreg'] = coef_df['Coefficient_logreg'].apply(lambda x: 'Positive' if x > 0 else 'Negative')
Интерпретация Важности Признаков
Для линейной модели важность признаков можно оценить по значению весов
- Positive: Если коэффициент положительный, это означает, что увеличение значения признака увеличивает вероятность положительного исхода.
- Negative: Если коэффициент отрицательный, это означает, что увеличение значения признака уменьшает вероятность положительного исхода.
coef_df
| Feature | Coefficient_logreg | Interpretation_logreg | |
|---|---|---|---|
| 0 | CreditScore | -0.064 | Negative |
| 1 | Age | 0.738 | Positive |
| 2 | Tenure | -0.032 | Negative |
| 3 | Balance | 0.169 | Positive |
| 4 | NumOfProducts | -0.128 | Negative |
| 5 | HasCrCard | -0.019 | Negative |
| 6 | IsActiveMember | -0.975 | Negative |
| 7 | EstimatedSalary | 0.047 | Positive |
| 8 | Geography_Germany | 0.707 | Positive |
| 9 | Geography_Spain | 0.021 | Positive |
| 10 | Gender_Male | -0.505 | Negative |
Метод Опорных Векторов¶
%%time
"""
- Будем исходить из предположений что в данных есть нелинейность, проверим линейное ядро и полиномиальное
- probability = True : На выходе вероятностное расспределение
'kernel': 'poly' говорит нам о том что полиномиалное ядно дает лучше результат
Из этого мы однозначно делаем вывод что в данных есть нелинейность
"""
param_grid = {
# 'C': np.linspace(0.001, 2, 50),
# 'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'kernel': ['linear', 'poly']
}
grid_search = GridSearchCV(SVC(random_state=42, probability=True, gamma='scale'),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1)
grid_search.fit(X_train_valid, y_train_valid)
best_SVC = grid_search.best_estimator_
print(best_SVC.get_params())
{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'poly', 'max_iter': -1, 'probability': True, 'random_state': 42, 'shrinking': True, 'tol': 0.001, 'verbose': False}
CPU times: user 9.52 s, sys: 305 ms, total: 9.82 s
Wall time: 52.1 s
res_SVC = print_metrics(best_SVC, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_SVC
| precision | recall | f1 | roc_auc | |
|---|---|---|---|---|
| train | 0.856 | 0.247 | 0.383 | 0.825 |
| valid | 0.885 | 0.246 | 0.385 | 0.841 |
| test | 0.850 | 0.241 | 0.376 | 0.812 |
Интерпритация Важность Признаков
Permutation Importance
- Интерпретировать веса у полиномиальной модели как с линейной мы не сможем
- Воспользуемся другим подходом permutation importance
- Она смотрит на важность признака в контексте
- Приставим приннак h
- PI перемешивает все значение в этом столбце (получается что он испортил колонку)
- Обучает модель с учетом испорченной фичи
- запоминаем качество модели и запоминаем ее
- анологично делаем для всех остальных колонок
- чем сильнее падает качество при перемешенной колонки тем больше вклад признака в модели
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(best_SVC, X_test, y_test)
features = np.array(X_test.columns)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(features[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance");
Решающее Дерево¶
Параметры
criterion: Критерии на основе которого происходит разбиение на два класса , отвечает за характер разбиения- max_depth : максимальная глубина дерева
- min_samples_leaf : Минимальное количество рядов в предсказаниях
%%time
param_grid = {
'criterion': ['gini', 'entropy', 'log_loss'],
# 'splitter': ['best', 'random'],
'max_depth': range(1, 11),
# 'min_samples_split': range(2, 10),
'min_samples_leaf': range(2, 10)
}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1)
grid_search.fit(X_train_valid, y_train_valid)
best_tree = grid_search.best_estimator_
print(best_tree.get_params())
{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'entropy', 'max_depth': 6, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 9, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': 42, 'splitter': 'best'}
CPU times: user 1.17 s, sys: 127 ms, total: 1.29 s
Wall time: 31.1 s
res_tree = print_metrics(best_tree, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_tree
| precision | recall | f1 | roc_auc | |
|---|---|---|---|---|
| train | 0.767 | 0.485 | 0.594 | 0.868 |
| valid | 0.813 | 0.459 | 0.587 | 0.869 |
| test | 0.744 | 0.440 | 0.553 | 0.841 |
def plot_importance(model, X):
# Получение важности признаков
feature_importances = model.feature_importances_
# Создание DataFrame для важности признаков
importance_df = pd.DataFrame({
'Feature': X.columns,
'Importance': feature_importances
})
# Сортировка DataFrame по важности
importance_df = importance_df.sort_values(by='Importance', ascending=False)
# Визуализация важности признаков
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importances in Decision Tree')
plt.show()
plot_importance(best_tree, X_train_valid)
Случайный Лес¶
Оптимизация Гиперпараметров
- оптимизатор для подбора параметров
- оптимизатор использует байевскую оптимизацию
!pip install optuna -qqq
import optuna
# оптимизируем
def objective(trial):
# гиперпараметры случайного леса
param = {
'criterion': trial.suggest_categorical("criterion", ['gini', 'entropy', 'log_loss']),
'n_estimators': trial.suggest_int("n_estimators", 10, 100),
"max_depth": trial.suggest_int("max_depth", 2, 10), # сократить кол-во деревьев
'min_samples_split': trial.suggest_int("min_samples_split", 2, 10),
'min_samples_leaf': trial.suggest_int("min_samples_leaf", 2, 10),
}
model = RandomForestClassifier(**param, random_state=random_state)
model.fit(X_train, y_train)
preds = model.predict_proba(X_valid)[:,1]
auc = roc_auc_score(y_valid, preds)
return auc
# максимизируем roc-auc
study = optuna.create_study(direction="maximize",
study_name='RandomForestClassifier')
study.optimize(objective, n_trials=10) # попробовать увеличить
print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial
print("Value: {}".format(trial.value))
print("Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
[I 2024-08-14 03:46:14,901] A new study created in memory with name: RandomForestClassifier
[I 2024-08-14 03:46:15,008] Trial 0 finished with value: 0.8334212486754861 and parameters: {'criterion': 'log_loss', 'n_estimators': 15, 'max_depth': 5, 'min_samples_split': 8, 'min_samples_leaf': 8}. Best is trial 0 with value: 0.8334212486754861.
[I 2024-08-14 03:46:15,213] Trial 1 finished with value: 0.8172085799204443 and parameters: {'criterion': 'gini', 'n_estimators': 38, 'max_depth': 2, 'min_samples_split': 8, 'min_samples_leaf': 10}. Best is trial 0 with value: 0.8334212486754861.
[I 2024-08-14 03:46:15,408] Trial 2 finished with value: 0.8193139210088363 and parameters: {'criterion': 'log_loss', 'n_estimators': 17, 'max_depth': 3, 'min_samples_split': 5, 'min_samples_leaf': 9}. Best is trial 0 with value: 0.8334212486754861.
[I 2024-08-14 03:46:16,070] Trial 3 finished with value: 0.8264767078326399 and parameters: {'criterion': 'log_loss', 'n_estimators': 61, 'max_depth': 2, 'min_samples_split': 3, 'min_samples_leaf': 6}. Best is trial 0 with value: 0.8334212486754861.
[I 2024-08-14 03:46:16,818] Trial 4 finished with value: 0.8437150555794624 and parameters: {'criterion': 'log_loss', 'n_estimators': 53, 'max_depth': 5, 'min_samples_split': 2, 'min_samples_leaf': 7}. Best is trial 4 with value: 0.8437150555794624.
[I 2024-08-14 03:46:18,630] Trial 5 finished with value: 0.8498282566079176 and parameters: {'criterion': 'gini', 'n_estimators': 68, 'max_depth': 9, 'min_samples_split': 6, 'min_samples_leaf': 9}. Best is trial 5 with value: 0.8498282566079176.
[I 2024-08-14 03:46:18,932] Trial 6 finished with value: 0.8170481729803764 and parameters: {'criterion': 'log_loss', 'n_estimators': 13, 'max_depth': 3, 'min_samples_split': 9, 'min_samples_leaf': 7}. Best is trial 5 with value: 0.8498282566079176.
[I 2024-08-14 03:46:19,959] Trial 7 finished with value: 0.8418210197871214 and parameters: {'criterion': 'log_loss', 'n_estimators': 65, 'max_depth': 4, 'min_samples_split': 9, 'min_samples_leaf': 5}. Best is trial 5 with value: 0.8498282566079176.
[I 2024-08-14 03:46:20,197] Trial 8 finished with value: 0.8422798761781811 and parameters: {'criterion': 'log_loss', 'n_estimators': 15, 'max_depth': 6, 'min_samples_split': 8, 'min_samples_leaf': 4}. Best is trial 5 with value: 0.8498282566079176.
[I 2024-08-14 03:46:21,125] Trial 9 finished with value: 0.8426600714736308 and parameters: {'criterion': 'log_loss', 'n_estimators': 96, 'max_depth': 4, 'min_samples_split': 5, 'min_samples_leaf': 10}. Best is trial 5 with value: 0.8498282566079176.
Number of finished trials: 10
Best trial:
Value: 0.8498282566079176
Params:
criterion: gini
n_estimators: 68
max_depth: 9
min_samples_split: 6
min_samples_leaf: 9
# Обучаем модель с оптимальными параметрами
best_forest = RandomForestClassifier(**study.best_params, random_state=random_state)
best_forest.fit(X_train_valid, y_train_valid)
res_forest = print_metrics(best_forest, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_forest
| precision | recall | f1 | roc_auc | |
|---|---|---|---|---|
| train | 0.863 | 0.463 | 0.603 | 0.914 |
| valid | 0.894 | 0.477 | 0.622 | 0.908 |
| test | 0.795 | 0.411 | 0.542 | 0.857 |
Интерпретация Важности Признаков
plot_importance(best_forest, X_train_valid)
Градиентный Бустинг¶
Выбор гиперпараметров
learning_Rate : Скорость обучения, определяющая, насколько сильно обновляются веса модели на каждом шаге
max_depth : Максимальная глубина деревьев решений
l2_leaf_reg : Регуляризация L2 для листьев деревьев
subsample : Доля выборки, используемая для обучения каждого дерева
random_strength
- Параметр, отвечающий за случайность в процессе построения деревьев
- Он влияет на то, насколько сильно случайные изменения влияют на процесс разделения данных в каждом узле дерева
min_data_in_leaf : Минимальное количество данных, необходимых для создания листа в дереве.
Определение целевой функции
Нам нужно задать функцию которую мы будем оптимитизировать
from catboost import CatBoostClassifier
def objective(trial):
param = {
"learning_rate": trial.suggest_float('learning_rate', 0.01, 0.9),
"max_depth": trial.suggest_int("max_depth", 2, 10),
"l2_leaf_reg":trial.suggest_float('l2_leaf_reg', 0.01, 2),
"subsample": trial.suggest_float('subsample', 0.01, 1),
"random_strength": trial.suggest_float('random_strength', 1, 200),
"min_data_in_leaf":trial.suggest_float('min_data_in_leaf', 1, 500)
}
cat = CatBoostClassifier(
logging_level="Silent",
eval_metric="AUC",
grow_policy="Lossguide",
random_seed=42,
**param)
cat.fit(X_train, y_train,
eval_set=(X_valid, y_valid),
verbose=False,
early_stopping_rounds=10
)
preds = cat.predict_proba(X_valid)[:,1]
auc = roc_auc_score(y_valid, preds)
return auc
Создаем объект исследования и запускаем оптимизацию
study = optuna.create_study(direction="maximize", study_name='CatBoostClassifier')
study.optimize(objective, n_trials=100) # попробуйте увеличить
print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial
print("Value: {}".format(trial.value))
print("Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
[I 2024-08-14 03:46:44,732] A new study created in memory with name: CatBoostClassifier
[I 2024-08-14 03:46:44,948] Trial 0 finished with value: 0.5 and parameters: {'learning_rate': 0.2165534023121395, 'max_depth': 2, 'l2_leaf_reg': 1.5477322039068058, 'subsample': 0.029345419918240072, 'random_strength': 6.5809964961892495, 'min_data_in_leaf': 477.52416517082804}. Best is trial 0 with value: 0.5.
[I 2024-08-14 03:46:45,339] Trial 1 finished with value: 0.8372625321777863 and parameters: {'learning_rate': 0.789728112113311, 'max_depth': 4, 'l2_leaf_reg': 0.6004269201561995, 'subsample': 0.55359191980997, 'random_strength': 46.57861106355321, 'min_data_in_leaf': 425.7303912933811}. Best is trial 1 with value: 0.8372625321777863.
[I 2024-08-14 03:46:45,992] Trial 2 finished with value: 0.8485835604479672 and parameters: {'learning_rate': 0.4634724182719374, 'max_depth': 6, 'l2_leaf_reg': 0.575436612277608, 'subsample': 0.42186545862300934, 'random_strength': 123.97404853733563, 'min_data_in_leaf': 98.4208597017056}. Best is trial 2 with value: 0.8485835604479672.
[I 2024-08-14 03:46:48,201] Trial 3 finished with value: 0.8515032752320888 and parameters: {'learning_rate': 0.09543005782339105, 'max_depth': 9, 'l2_leaf_reg': 1.2167151901623765, 'subsample': 0.06245537706270228, 'random_strength': 61.228218388468974, 'min_data_in_leaf': 38.39023875452006}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:48,708] Trial 4 finished with value: 0.8406711796542305 and parameters: {'learning_rate': 0.7331013466248556, 'max_depth': 6, 'l2_leaf_reg': 0.012765794151882687, 'subsample': 0.14335624784813372, 'random_strength': 73.50485090264708, 'min_data_in_leaf': 380.00905654802966}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:50,233] Trial 5 finished with value: 0.8435615893243013 and parameters: {'learning_rate': 0.23370405202769556, 'max_depth': 9, 'l2_leaf_reg': 0.16497446248632272, 'subsample': 0.8752850397650734, 'random_strength': 111.64581392121151, 'min_data_in_leaf': 388.6064442254268}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:51,271] Trial 6 finished with value: 0.8226369667047633 and parameters: {'learning_rate': 0.04122993099237723, 'max_depth': 2, 'l2_leaf_reg': 0.20284874078763926, 'subsample': 0.6367717313128627, 'random_strength': 161.4305343264782, 'min_data_in_leaf': 18.271195536437006}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:51,778] Trial 7 finished with value: 0.8371075235482015 and parameters: {'learning_rate': 0.8582564268296371, 'max_depth': 7, 'l2_leaf_reg': 0.6159108475992469, 'subsample': 0.6430359063610354, 'random_strength': 194.77206434441388, 'min_data_in_leaf': 395.81326570058985}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:52,106] Trial 8 finished with value: 0.8514030208945464 and parameters: {'learning_rate': 0.6220500822136842, 'max_depth': 5, 'l2_leaf_reg': 0.6614202111677746, 'subsample': 0.3763622594650606, 'random_strength': 1.5372606636099502, 'min_data_in_leaf': 210.81684346685665}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:53,023] Trial 9 finished with value: 0.8379396345498039 and parameters: {'learning_rate': 0.7670675853970256, 'max_depth': 8, 'l2_leaf_reg': 1.8403233110251145, 'subsample': 0.33221049473597575, 'random_strength': 76.59908186125371, 'min_data_in_leaf': 220.27667189607422}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:53,795] Trial 10 finished with value: 0.8352975471619539 and parameters: {'learning_rate': 0.03939269976022153, 'max_depth': 10, 'l2_leaf_reg': 1.2239191289705484, 'subsample': 0.1971957977790358, 'random_strength': 45.97984119612045, 'min_data_in_leaf': 117.68547624545266}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:54,182] Trial 11 finished with value: 0.8534204466407855 and parameters: {'learning_rate': 0.5292947899032613, 'max_depth': 4, 'l2_leaf_reg': 1.0562715822887339, 'subsample': 0.29238467023057874, 'random_strength': 1.0080155926863474, 'min_data_in_leaf': 217.05912068009718}. Best is trial 11 with value: 0.8534204466407855.
[I 2024-08-14 03:46:54,925] Trial 12 finished with value: 0.854156159240905 and parameters: {'learning_rate': 0.46667965752333035, 'max_depth': 4, 'l2_leaf_reg': 1.1034539045748595, 'subsample': 0.2177545543283828, 'random_strength': 33.69191424519235, 'min_data_in_leaf': 287.6354438748095}. Best is trial 12 with value: 0.854156159240905.
[I 2024-08-14 03:46:55,261] Trial 13 finished with value: 0.8481239328696957 and parameters: {'learning_rate': 0.47388030842524104, 'max_depth': 4, 'l2_leaf_reg': 0.9841376098092054, 'subsample': 0.23796767505443062, 'random_strength': 30.559857725987595, 'min_data_in_leaf': 304.8772947881302}. Best is trial 12 with value: 0.854156159240905.
[I 2024-08-14 03:46:55,774] Trial 14 finished with value: 0.8512302749590884 and parameters: {'learning_rate': 0.5677435867325714, 'max_depth': 4, 'l2_leaf_reg': 0.9858935858858828, 'subsample': 0.2880787724849706, 'random_strength': 19.641914653188316, 'min_data_in_leaf': 296.50997464128113}. Best is trial 12 with value: 0.854156159240905.
[I 2024-08-14 03:46:56,213] Trial 15 finished with value: 0.8494503748741036 and parameters: {'learning_rate': 0.37166572152717614, 'max_depth': 3, 'l2_leaf_reg': 1.4583249930757494, 'subsample': 0.9955315242420071, 'random_strength': 88.02174705220291, 'min_data_in_leaf': 179.3738772355934}. Best is trial 12 with value: 0.854156159240905.
[I 2024-08-14 03:46:57,055] Trial 16 finished with value: 0.8559885000562966 and parameters: {'learning_rate': 0.3210310795870268, 'max_depth': 5, 'l2_leaf_reg': 1.9094658324753917, 'subsample': 0.4804851220395184, 'random_strength': 33.49522734292147, 'min_data_in_leaf': 287.9109110478402}. Best is trial 16 with value: 0.8559885000562966.
[I 2024-08-14 03:46:57,763] Trial 17 finished with value: 0.847795407117441 and parameters: {'learning_rate': 0.35378097903239786, 'max_depth': 5, 'l2_leaf_reg': 1.9401989755988598, 'subsample': 0.48416569735771703, 'random_strength': 40.07318375520298, 'min_data_in_leaf': 303.24367032588754}. Best is trial 16 with value: 0.8559885000562966.
[I 2024-08-14 03:46:58,649] Trial 18 finished with value: 0.8543381594229051 and parameters: {'learning_rate': 0.3270310046948177, 'max_depth': 5, 'l2_leaf_reg': 1.6932697259348661, 'subsample': 0.79447929672279, 'random_strength': 136.12105973617452, 'min_data_in_leaf': 329.93271902824506}. Best is trial 16 with value: 0.8559885000562966.
[I 2024-08-14 03:46:59,576] Trial 19 finished with value: 0.8500765788901381 and parameters: {'learning_rate': 0.2708829352692093, 'max_depth': 7, 'l2_leaf_reg': 1.7038356958216352, 'subsample': 0.7632729569486497, 'random_strength': 138.4597489310068, 'min_data_in_leaf': 343.8985733074137}. Best is trial 16 with value: 0.8559885000562966.
[I 2024-08-14 03:47:00,116] Trial 20 finished with value: 0.8531150565048871 and parameters: {'learning_rate': 0.3626267613161781, 'max_depth': 5, 'l2_leaf_reg': 1.9795846165177058, 'subsample': 0.7827032985899299, 'random_strength': 142.73839861496384, 'min_data_in_leaf': 493.0493076589869}. Best is trial 16 with value: 0.8559885000562966.
[I 2024-08-14 03:47:00,940] Trial 21 finished with value: 0.8563910597808904 and parameters: {'learning_rate': 0.16961344589594285, 'max_depth': 3, 'l2_leaf_reg': 1.649834982836516, 'subsample': 0.5749517198100278, 'random_strength': 99.89102990362149, 'min_data_in_leaf': 271.7012421453372}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:02,167] Trial 22 finished with value: 0.8344091394938851 and parameters: {'learning_rate': 0.14552950442407062, 'max_depth': 3, 'l2_leaf_reg': 1.6649232359966202, 'subsample': 0.5985514802507333, 'random_strength': 169.99871764933434, 'min_data_in_leaf': 261.3080549135861}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:04,114] Trial 23 finished with value: 0.8545186172304817 and parameters: {'learning_rate': 0.15793176071004453, 'max_depth': 3, 'l2_leaf_reg': 1.3526112935873842, 'subsample': 0.7175761381145475, 'random_strength': 113.99285178004443, 'min_data_in_leaf': 352.6566495237683}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:06,560] Trial 24 finished with value: 0.8553638384146858 and parameters: {'learning_rate': 0.1556921373541092, 'max_depth': 3, 'l2_leaf_reg': 1.4115008286364423, 'subsample': 0.480391200689659, 'random_strength': 100.39425793454086, 'min_data_in_leaf': 350.01711545220127}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:08,806] Trial 25 finished with value: 0.8480560683950514 and parameters: {'learning_rate': 0.16409141193428203, 'max_depth': 2, 'l2_leaf_reg': 1.496137374339839, 'subsample': 0.46620575811523923, 'random_strength': 91.45570082665543, 'min_data_in_leaf': 163.57983997102036}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:10,275] Trial 26 finished with value: 0.8544893121164308 and parameters: {'learning_rate': 0.2837819547192468, 'max_depth': 3, 'l2_leaf_reg': 1.8020953679362315, 'subsample': 0.5189916467893968, 'random_strength': 101.08298523619007, 'min_data_in_leaf': 253.22233056776207}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:10,680] Trial 27 finished with value: 0.8211246685822957 and parameters: {'learning_rate': 0.1044556356752651, 'max_depth': 2, 'l2_leaf_reg': 1.3459091803909486, 'subsample': 0.431619565523395, 'random_strength': 59.294142417791434, 'min_data_in_leaf': 449.8387981624195}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:10,968] Trial 28 finished with value: 0.8035146085993543 and parameters: {'learning_rate': 0.21613547632724206, 'max_depth': 3, 'l2_leaf_reg': 1.6092822334057308, 'subsample': 0.5676045993284018, 'random_strength': 72.80957980792742, 'min_data_in_leaf': 263.0893593280764}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:11,494] Trial 29 finished with value: 0.8547792785080921 and parameters: {'learning_rate': 0.2086592133839792, 'max_depth': 2, 'l2_leaf_reg': 0.819697698152623, 'subsample': 0.6834571224386156, 'random_strength': 19.984681853813314, 'min_data_in_leaf': 453.6832846244588}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:12,284] Trial 30 finished with value: 0.8367967351018197 and parameters: {'learning_rate': 0.3944960136703665, 'max_depth': 7, 'l2_leaf_reg': 1.8676872612497761, 'subsample': 0.38308326997649067, 'random_strength': 154.4491135123389, 'min_data_in_leaf': 332.6615450443436}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:12,895] Trial 31 finished with value: 0.8546836512938208 and parameters: {'learning_rate': 0.21596851649554094, 'max_depth': 2, 'l2_leaf_reg': 0.8664350791512772, 'subsample': 0.6869620770295259, 'random_strength': 24.469562538665006, 'min_data_in_leaf': 444.9760164258388}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:13,411] Trial 32 finished with value: 0.8548571684164904 and parameters: {'learning_rate': 0.28016951231760356, 'max_depth': 2, 'l2_leaf_reg': 0.806914526399829, 'subsample': 0.5400321582582661, 'random_strength': 16.595337724789662, 'min_data_in_leaf': 425.5676845159874}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:14,332] Trial 33 finished with value: 0.8465121515968972 and parameters: {'learning_rate': 0.2943832841669326, 'max_depth': 3, 'l2_leaf_reg': 1.4002865433211302, 'subsample': 0.5244965679173971, 'random_strength': 55.18009770759243, 'min_data_in_leaf': 411.0478307080825}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:15,095] Trial 34 finished with value: 0.8532785481938023 and parameters: {'learning_rate': 0.41204275970364945, 'max_depth': 4, 'l2_leaf_reg': 1.5580897723724074, 'subsample': 0.588510970323773, 'random_strength': 119.79281757796326, 'min_data_in_leaf': 388.1140383915803}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:15,717] Trial 35 finished with value: 0.8286869303818456 and parameters: {'learning_rate': 0.08975166153572325, 'max_depth': 2, 'l2_leaf_reg': 1.2419086491253597, 'subsample': 0.44346103753859356, 'random_strength': 12.29947978876556, 'min_data_in_leaf': 364.58824440904306}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:16,582] Trial 36 finished with value: 0.852669310296429 and parameters: {'learning_rate': 0.250033922012678, 'max_depth': 6, 'l2_leaf_reg': 0.42864672108139, 'subsample': 0.5268262007033628, 'random_strength': 95.42973443090546, 'min_data_in_leaf': 429.3365607018824}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:17,024] Trial 37 finished with value: 0.8127665415801009 and parameters: {'learning_rate': 0.1691762432733271, 'max_depth': 3, 'l2_leaf_reg': 1.7829743976227164, 'subsample': 0.3765836616592204, 'random_strength': 105.88486237007761, 'min_data_in_leaf': 475.9380950106968}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:17,801] Trial 38 finished with value: 0.852903751208836 and parameters: {'learning_rate': 0.3144852057758997, 'max_depth': 5, 'l2_leaf_reg': 0.7468205513730214, 'subsample': 0.6288168983269199, 'random_strength': 81.03884687055434, 'min_data_in_leaf': 278.02630322886773}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:18,104] Trial 39 finished with value: 0.7429039208700227 and parameters: {'learning_rate': 0.0916001703115347, 'max_depth': 2, 'l2_leaf_reg': 0.3209407231862438, 'subsample': 0.48947279248425846, 'random_strength': 63.425336876240344, 'min_data_in_leaf': 312.50209443952326}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:18,299] Trial 40 finished with value: 0.5 and parameters: {'learning_rate': 0.6819438096729318, 'max_depth': 6, 'l2_leaf_reg': 1.1696390615306496, 'subsample': 0.02690876704491979, 'random_strength': 186.05663001448391, 'min_data_in_leaf': 237.24363431981527}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:18,628] Trial 41 finished with value: 0.8275224376919292 and parameters: {'learning_rate': 0.20539423781999722, 'max_depth': 2, 'l2_leaf_reg': 0.7875365906927447, 'subsample': 0.7197475219510165, 'random_strength': 14.4847764719263, 'min_data_in_leaf': 469.9772033903035}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:19,489] Trial 42 finished with value: 0.8467774399977791 and parameters: {'learning_rate': 0.11258151909406194, 'max_depth': 2, 'l2_leaf_reg': 0.8721010222525916, 'subsample': 0.6594836798807467, 'random_strength': 48.89095313702967, 'min_data_in_leaf': 402.73822377455843}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:20,194] Trial 43 finished with value: 0.8518981230845636 and parameters: {'learning_rate': 0.19221299265782907, 'max_depth': 3, 'l2_leaf_reg': 0.4867586865668939, 'subsample': 0.5571688603077394, 'random_strength': 31.93422189548409, 'min_data_in_leaf': 363.31928598127024}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:22,996] Trial 44 finished with value: 0.8487424250136115 and parameters: {'learning_rate': 0.039144856136337974, 'max_depth': 4, 'l2_leaf_reg': 0.9009428480096298, 'subsample': 0.8375833862722262, 'random_strength': 6.814579833825066, 'min_data_in_leaf': 432.94909658576285}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:23,939] Trial 45 finished with value: 0.8486205774341369 and parameters: {'learning_rate': 0.2510132838894006, 'max_depth': 2, 'l2_leaf_reg': 0.6907636418579387, 'subsample': 0.6280995182686069, 'random_strength': 24.31951969212401, 'min_data_in_leaf': 374.4772844339267}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:25,057] Trial 46 finished with value: 0.8486591367947299 and parameters: {'learning_rate': 0.4177823887952824, 'max_depth': 3, 'l2_leaf_reg': 0.5837206207668852, 'subsample': 0.10938321642670351, 'random_strength': 41.931702968776605, 'min_data_in_leaf': 178.84495282770186}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:25,696] Trial 47 finished with value: 0.823857756061146 and parameters: {'learning_rate': 0.019782576463977575, 'max_depth': 4, 'l2_leaf_reg': 1.9024552651741735, 'subsample': 0.40297843167975866, 'random_strength': 66.82247683667727, 'min_data_in_leaf': 316.8543796002272}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:26,796] Trial 48 finished with value: 0.8527418018943443 and parameters: {'learning_rate': 0.12284365746995335, 'max_depth': 8, 'l2_leaf_reg': 1.5884344189532063, 'subsample': 0.3276266788784489, 'random_strength': 51.29989880997462, 'min_data_in_leaf': 459.0054658582284}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:27,241] Trial 49 finished with value: 0.8540805828941422 and parameters: {'learning_rate': 0.3299775712117529, 'max_depth': 3, 'l2_leaf_reg': 1.0878543895513753, 'subsample': 0.6942586059012404, 'random_strength': 11.713670742817778, 'min_data_in_leaf': 495.2165248684829}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:27,758] Trial 50 finished with value: 0.8472201014573897 and parameters: {'learning_rate': 0.25324978223894085, 'max_depth': 4, 'l2_leaf_reg': 1.7363552106408617, 'subsample': 0.5877810779665728, 'random_strength': 37.972565443197155, 'min_data_in_leaf': 412.6914975963677}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:28,249] Trial 51 finished with value: 0.8474005592649662 and parameters: {'learning_rate': 0.19612243767630727, 'max_depth': 2, 'l2_leaf_reg': 0.8917975193826132, 'subsample': 0.6848753748976462, 'random_strength': 24.324814512268816, 'min_data_in_leaf': 436.96437349015264}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:28,631] Trial 52 finished with value: 0.8247153162407398 and parameters: {'learning_rate': 0.06857413116836153, 'max_depth': 2, 'l2_leaf_reg': 0.8093899327079714, 'subsample': 0.8856644877809534, 'random_strength': 22.15656724840708, 'min_data_in_leaf': 454.16888064861433}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:28,958] Trial 53 finished with value: 0.8552512450817537 and parameters: {'learning_rate': 0.2277359380751674, 'max_depth': 2, 'l2_leaf_reg': 1.0044730574652196, 'subsample': 0.5426082432903984, 'random_strength': 2.092346231014936, 'min_data_in_leaf': 283.86020680504697}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:29,458] Trial 54 finished with value: 0.8545217019793291 and parameters: {'learning_rate': 0.3128193984950219, 'max_depth': 3, 'l2_leaf_reg': 1.008665858224686, 'subsample': 0.47370734259106645, 'random_strength': 6.662500258650947, 'min_data_in_leaf': 283.74741280019464}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:29,754] Trial 55 finished with value: 0.8421464607905287 and parameters: {'learning_rate': 0.8773525297785807, 'max_depth': 2, 'l2_leaf_reg': 1.149059206077042, 'subsample': 0.5453409140242778, 'random_strength': 128.55684672230558, 'min_data_in_leaf': 233.31803747702202}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:30,145] Trial 56 finished with value: 0.8550931517033213 and parameters: {'learning_rate': 0.13656066626789937, 'max_depth': 2, 'l2_leaf_reg': 1.2919909470689186, 'subsample': 0.43933745123796614, 'random_strength': 1.7054069646826093, 'min_data_in_leaf': 62.17289685684315}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:30,493] Trial 57 finished with value: 0.8527880731270562 and parameters: {'learning_rate': 0.13041630646518754, 'max_depth': 10, 'l2_leaf_reg': 1.274378826982479, 'subsample': 0.34334490211186264, 'random_strength': 4.2565201685199305, 'min_data_in_leaf': 71.55126556167352}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:30,682] Trial 58 finished with value: 0.8514184446387836 and parameters: {'learning_rate': 0.16594689210860503, 'max_depth': 3, 'l2_leaf_reg': 1.4888320186002333, 'subsample': 0.4376508602553443, 'random_strength': 1.3885552941461277, 'min_data_in_leaf': 152.70173531249966}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:31,007] Trial 59 finished with value: 0.8540620744010574 and parameters: {'learning_rate': 0.2750010284884532, 'max_depth': 5, 'l2_leaf_reg': 1.2746287691142977, 'subsample': 0.5071138630239854, 'random_strength': 16.021246309606525, 'min_data_in_leaf': 8.92644330176222}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:31,275] Trial 60 finished with value: 0.8236294846464337 and parameters: {'learning_rate': 0.06580145985259861, 'max_depth': 4, 'l2_leaf_reg': 1.4207580932038155, 'subsample': 0.2559732452953547, 'random_strength': 34.10123336167225, 'min_data_in_leaf': 35.04352061528167}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:31,513] Trial 61 finished with value: 0.8484570857452214 and parameters: {'learning_rate': 0.23125001722207209, 'max_depth': 2, 'l2_leaf_reg': 0.951125702424557, 'subsample': 0.45236883411477885, 'random_strength': 16.37629361091796, 'min_data_in_leaf': 276.32032283099494}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:31,848] Trial 62 finished with value: 0.8491604084824422 and parameters: {'learning_rate': 0.18521171522349617, 'max_depth': 2, 'l2_leaf_reg': 1.0233833614681076, 'subsample': 0.6028311345426446, 'random_strength': 27.84478190364073, 'min_data_in_leaf': 212.63167127467798}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:32,139] Trial 63 finished with value: 0.8579966715559936 and parameters: {'learning_rate': 0.2275828875803787, 'max_depth': 2, 'l2_leaf_reg': 1.649969257251929, 'subsample': 0.41432801712540723, 'random_strength': 9.871440937029153, 'min_data_in_leaf': 339.4206240957627}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:32,353] Trial 64 finished with value: 0.8410583156345868 and parameters: {'learning_rate': 0.15443576863058872, 'max_depth': 3, 'l2_leaf_reg': 1.6356148561301187, 'subsample': 0.40829684886366496, 'random_strength': 8.920803659606767, 'min_data_in_leaf': 329.25270044657947}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:32,559] Trial 65 finished with value: 0.8475794746981188 and parameters: {'learning_rate': 0.52262191640972, 'max_depth': 2, 'l2_leaf_reg': 1.7570515967211897, 'subsample': 0.35209745591260677, 'random_strength': 79.31033248169595, 'min_data_in_leaf': 298.8152094762548}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:32,817] Trial 66 finished with value: 0.8545409816596259 and parameters: {'learning_rate': 0.3397594287720047, 'max_depth': 3, 'l2_leaf_reg': 1.51439591232287, 'subsample': 0.48690121670146136, 'random_strength': 113.64174712891179, 'min_data_in_leaf': 344.9705847573752}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:33,180] Trial 67 finished with value: 0.8527741917572427 and parameters: {'learning_rate': 0.2872741864314234, 'max_depth': 6, 'l2_leaf_reg': 1.9758241623536021, 'subsample': 0.30706088302917706, 'random_strength': 86.69768770901611, 'min_data_in_leaf': 195.60397604666105}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:33,542] Trial 68 finished with value: 0.8388542625830762 and parameters: {'learning_rate': 0.37734226459157394, 'max_depth': 7, 'l2_leaf_reg': 1.84386701582687, 'subsample': 0.5416787382022203, 'random_strength': 102.93491264871817, 'min_data_in_leaf': 100.73573695821503}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:33,798] Trial 69 finished with value: 0.8491079677520357 and parameters: {'learning_rate': 0.2396158685166121, 'max_depth': 3, 'l2_leaf_reg': 1.3497075412335753, 'subsample': 0.41533858440585675, 'random_strength': 9.011802604306009, 'min_data_in_leaf': 244.2984153117823}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:34,249] Trial 70 finished with value: 0.8551000923882279 and parameters: {'learning_rate': 0.1405771071007593, 'max_depth': 2, 'l2_leaf_reg': 1.6670425816135785, 'subsample': 0.5628243022632802, 'random_strength': 43.63893059165331, 'min_data_in_leaf': 265.0039335083252}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:34,463] Trial 71 finished with value: 0.8369964725896929 and parameters: {'learning_rate': 0.1346261793235442, 'max_depth': 2, 'l2_leaf_reg': 1.69505852020515, 'subsample': 0.5662727341567123, 'random_strength': 18.17730751314747, 'min_data_in_leaf': 261.9534789983721}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:34,721] Trial 72 finished with value: 0.8321156287257981 and parameters: {'learning_rate': 0.07724194746378436, 'max_depth': 2, 'l2_leaf_reg': 1.6529774456888673, 'subsample': 0.5130735404504633, 'random_strength': 43.04440541458727, 'min_data_in_leaf': 271.8748406602916}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:34,936] Trial 73 finished with value: 0.8534474381932009 and parameters: {'learning_rate': 0.2193541378472138, 'max_depth': 2, 'l2_leaf_reg': 1.55127358733355, 'subsample': 0.45844426802461946, 'random_strength': 1.2277571480166767, 'min_data_in_leaf': 323.54337938841485}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:35,129] Trial 74 finished with value: 0.8245140363784432 and parameters: {'learning_rate': 0.17096273713989574, 'max_depth': 3, 'l2_leaf_reg': 1.4299055079718532, 'subsample': 0.6103423491223143, 'random_strength': 35.52778769358011, 'min_data_in_leaf': 309.4874282846581}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:35,471] Trial 75 finished with value: 0.849687900535358 and parameters: {'learning_rate': 0.1424757474479632, 'max_depth': 2, 'l2_leaf_reg': 1.9040100631862928, 'subsample': 0.49178296476749883, 'random_strength': 29.861178645449915, 'min_data_in_leaf': 288.4169959906686}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:35,653] Trial 76 finished with value: 0.8166471556302065 and parameters: {'learning_rate': 0.10900450958876559, 'max_depth': 3, 'l2_leaf_reg': 1.1584526059635818, 'subsample': 0.5436945071994185, 'random_strength': 97.11608635264365, 'min_data_in_leaf': 252.08827881313073}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:36,111] Trial 77 finished with value: 0.8529130054553784 and parameters: {'learning_rate': 0.26344618455418334, 'max_depth': 2, 'l2_leaf_reg': 1.8036914133857063, 'subsample': 0.5759920935180337, 'random_strength': 149.97345964123704, 'min_data_in_leaf': 230.53437749650683}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:36,464] Trial 78 finished with value: 0.8554301605149063 and parameters: {'learning_rate': 0.30386516642561456, 'max_depth': 2, 'l2_leaf_reg': 1.5941173743706698, 'subsample': 0.388487385094279, 'random_strength': 12.04964504414562, 'min_data_in_leaf': 340.15331434948547}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:36,925] Trial 79 finished with value: 0.8556607454912538 and parameters: {'learning_rate': 0.3043185361985488, 'max_depth': 3, 'l2_leaf_reg': 1.6062543081685263, 'subsample': 0.38163010145401816, 'random_strength': 10.801512721309756, 'min_data_in_leaf': 336.9094072537885}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:37,511] Trial 80 finished with value: 0.8507459693900372 and parameters: {'learning_rate': 0.2967690808832415, 'max_depth': 4, 'l2_leaf_reg': 1.5769102540118054, 'subsample': 0.3683396074299298, 'random_strength': 119.50189785234694, 'min_data_in_leaf': 352.7530072728084}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:37,788] Trial 81 finished with value: 0.8495382902162564 and parameters: {'learning_rate': 0.3519982264605916, 'max_depth': 3, 'l2_leaf_reg': 1.491074642500498, 'subsample': 0.39339814284208235, 'random_strength': 9.046452450187513, 'min_data_in_leaf': 338.6181638900527}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:38,153] Trial 82 finished with value: 0.8492305865187221 and parameters: {'learning_rate': 0.442607098792248, 'max_depth': 2, 'l2_leaf_reg': 1.7262120187355963, 'subsample': 0.27234819741871114, 'random_strength': 12.955991756657548, 'min_data_in_leaf': 381.46339177847165}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:38,943] Trial 83 finished with value: 0.8509557323116645 and parameters: {'learning_rate': 0.18048848769077014, 'max_depth': 9, 'l2_leaf_reg': 1.6096921762481307, 'subsample': 0.31232826961375537, 'random_strength': 27.6479977169267, 'min_data_in_leaf': 362.9959768769905}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:39,536] Trial 84 finished with value: 0.8530302259115817 and parameters: {'learning_rate': 0.22708933829872321, 'max_depth': 3, 'l2_leaf_reg': 1.6632316572246608, 'subsample': 0.4206405376875545, 'random_strength': 20.498316514413297, 'min_data_in_leaf': 297.40518856402844}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:40,270] Trial 85 finished with value: 0.844975946670862 and parameters: {'learning_rate': 0.3134824967461136, 'max_depth': 8, 'l2_leaf_reg': 1.458604426862953, 'subsample': 0.46636325980991755, 'random_strength': 172.65958600142233, 'min_data_in_leaf': 308.2224094733233}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:40,626] Trial 86 finished with value: 0.853733548648803 and parameters: {'learning_rate': 0.20516007929387697, 'max_depth': 4, 'l2_leaf_reg': 1.8081881256255257, 'subsample': 0.3669610320292627, 'random_strength': 71.7353570023819, 'min_data_in_leaf': 349.71334841544535}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:40,844] Trial 87 finished with value: 0.847503898351356 and parameters: {'learning_rate': 0.3850725922826087, 'max_depth': 2, 'l2_leaf_reg': 1.3738184718048245, 'subsample': 0.4394316413220218, 'random_strength': 13.073971750314525, 'min_data_in_leaf': 319.76257608276416}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:41,130] Trial 88 finished with value: 0.8509834950512916 and parameters: {'learning_rate': 0.2486495279955683, 'max_depth': 3, 'l2_leaf_reg': 1.3060721511513962, 'subsample': 0.5036611916707044, 'random_strength': 6.582515968741947, 'min_data_in_leaf': 288.24916069068973}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:41,464] Trial 89 finished with value: 0.8411153834882648 and parameters: {'learning_rate': 0.05111799564446984, 'max_depth': 2, 'l2_leaf_reg': 1.537356157859457, 'subsample': 0.3880066956376303, 'random_strength': 4.295324207554229, 'min_data_in_leaf': 266.43818936484587}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:41,615] Trial 90 finished with value: 0.809689504604759 and parameters: {'learning_rate': 0.14849530714748765, 'max_depth': 2, 'l2_leaf_reg': 1.604654407357084, 'subsample': 0.6488514855233791, 'random_strength': 109.07773151115863, 'min_data_in_leaf': 325.72222318586404}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:41,889] Trial 91 finished with value: 0.8535808535808537 and parameters: {'learning_rate': 0.2768348084629538, 'max_depth': 2, 'l2_leaf_reg': 1.7020206569000091, 'subsample': 0.5270113119832779, 'random_strength': 19.279543505451198, 'min_data_in_leaf': 390.8245718531585}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:42,108] Trial 92 finished with value: 0.8445248021519208 and parameters: {'learning_rate': 0.2980320322130359, 'max_depth': 2, 'l2_leaf_reg': 0.9413144021105265, 'subsample': 0.18491093410744158, 'random_strength': 23.462444073545914, 'min_data_in_leaf': 371.68996169817217}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:42,275] Trial 93 finished with value: 0.8438677506474116 and parameters: {'learning_rate': 0.8018049022359041, 'max_depth': 3, 'l2_leaf_reg': 0.7207307279562393, 'subsample': 0.4817377653790264, 'random_strength': 54.11215711265974, 'min_data_in_leaf': 247.83798275985313}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:42,629] Trial 94 finished with value: 0.8526831916662426 and parameters: {'learning_rate': 0.18976503763926902, 'max_depth': 5, 'l2_leaf_reg': 1.0778005788536056, 'subsample': 0.42051153997605584, 'random_strength': 15.934790667198342, 'min_data_in_leaf': 418.39615211072214}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:42,894] Trial 95 finished with value: 0.8465136939713211 and parameters: {'learning_rate': 0.3330384372789757, 'max_depth': 2, 'l2_leaf_reg': 1.9197824427277657, 'subsample': 0.5366007428124753, 'random_strength': 44.67016583545405, 'min_data_in_leaf': 200.40708572900584}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:43,205] Trial 96 finished with value: 0.8505886471988167 and parameters: {'learning_rate': 0.26241859653774113, 'max_depth': 3, 'l2_leaf_reg': 0.6304145123151321, 'subsample': 0.4542962823423417, 'random_strength': 37.17578672829132, 'min_data_in_leaf': 399.3636999400437}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:43,443] Trial 97 finished with value: 0.8340196899518935 and parameters: {'learning_rate': 0.09242477857682523, 'max_depth': 2, 'l2_leaf_reg': 0.0645769669423184, 'subsample': 0.5820960690872515, 'random_strength': 11.65162235424116, 'min_data_in_leaf': 358.5938499475734}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:43,772] Trial 98 finished with value: 0.8560116356726526 and parameters: {'learning_rate': 0.2238295228087367, 'max_depth': 2, 'l2_leaf_reg': 1.4538002550468356, 'subsample': 0.6177874774515071, 'random_strength': 26.070466696158935, 'min_data_in_leaf': 336.943504306358}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:44,246] Trial 99 finished with value: 0.8533078533078534 and parameters: {'learning_rate': 0.12342253771037937, 'max_depth': 3, 'l2_leaf_reg': 1.446914784821623, 'subsample': 0.6193773627486576, 'random_strength': 32.4812721291412, 'min_data_in_leaf': 336.8229861132391}. Best is trial 63 with value: 0.8579966715559936.
Number of finished trials: 100
Best trial:
Value: 0.8579966715559936
Params:
learning_rate: 0.2275828875803787
max_depth: 2
l2_leaf_reg: 1.649969257251929
subsample: 0.41432801712540723
random_strength: 9.871440937029153
min_data_in_leaf: 339.4206240957627
best_cat = CatBoostClassifier(**study.best_params, random_state=random_state)
best_cat.fit(X_train, y_train,
eval_set=(X_valid, y_valid),
verbose=False,
early_stopping_rounds=10
)
res_cat = print_metrics(best_cat, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_cat
| precision | recall | f1 | roc_auc | |
|---|---|---|---|---|
| train | 0.786 | 0.468 | 0.587 | 0.877 |
| valid | 0.825 | 0.464 | 0.594 | 0.855 |
| test | 0.754 | 0.442 | 0.557 | 0.856 |
Интерпретация Важности Признаков
Для градиентного спуска можно использовать SHAP значении для интерпретации важности признаков
- Чем выше признак, тем он важнее
- Чем краснее точка, тем выше значение признака
SHAP value
- Значения SHAP представляют собой вклад каждого признака в прогноз модели.
- Положительные значения SHAP указывают на то, что признак увеличивает прогноз (в сторону положительного класса; отток клиента)
Расспределение Точек
- Распределение точек для каждого признака показывает изменчивость влияния этого признака на прогнозы для различных экземпляров.
- Более широкое распределение указывает на то, что признак имеет разное влияние в зависимости от других факторов.
import shap
explainer = shap.TreeExplainer(best_cat)
shap_values = explainer(X_train_valid)
shap.plots.beeswarm(shap_values)