Look a like Segmentation
Задача Look-a-Like¶
Предсказание оттока клиентов¶
- Задачу look-a-like будем решать на примере поиска сегмента клиентов,
склонных к оттоку
из некого банка. - Датасет содержит ретро-данные о клиентах, оттекших из банка - целевой сегмент. Аналогично - имеются данные о тех, кто не оттек.
- Необходимо для любого другого клиента из тестовой выборки определить вероятность (склонность к оттоку).
Задача: построить модель классификации с предельно большим значением ROC-AUC
import pandas as pd
import numpy as np
import seaborn as sns
import random
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.utils import shuffle
import warnings
warnings.filterwarnings("ignore")
# Установка настроек для отображения всех колонок и строк при печати
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# заранее установим в константу random_state
random_state = 47
sns.set(style="whitegrid")
churn = pd.read_csv('/content/Churn_Modelling.csv')
print(churn.shape)
churn.head()
(10000, 14)
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
3 | 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
4 | 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
Признаки
RowNumber
— индекс строки в данныхCustomerId
— уникальный идентификатор клиентаSurname
— фамилияCreditScore
— кредитный рейтингGeography
— страна проживанияGender
— полAge
— возрастTenure
— сколько лет человек является клиентом банкаBalance
— баланс на счётеNumOfProducts
— количество продуктов банка, используемых клиентомHasCrCard
— наличие кредитной картыIsActiveMember
— активность клиентаEstimatedSalary
— предполагаемая зарплата
Целевой признак
Exited
— факт ухода клиента (1 - Отток)
Сразу исключим ненужные столбцы, чтобы модели не переобучались под пользователей:
churn = churn.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
churn.head()
CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
3 | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
4 | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
Визуализация Данных¶
# Создание подграфиков
fig, axes = plt.subplots(4, 2, figsize=(10, 13))
# Гистограмма кредитных баллов
sns.histplot(data=churn, x='CreditScore', kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Credit Scores')
# Гистограмма баланса на счёте
sns.histplot(data=churn, x='Balance', kde=True, ax=axes[0, 1])
axes[0, 1].set_title('Distribution of Balance')
# Гистограмма зарплат
sns.histplot(data=churn, x='EstimatedSalary', kde=True, ax=axes[1, 0])
axes[1, 0].set_title('Distribution of EstimatedSalary')
# Гистограмма возраста
sns.histplot(data=churn, x='Age', kde=True, ax=axes[1, 1])
axes[1, 1].set_title('Distribution of Age')
# Боксплот баланса по странам
sns.boxplot(x='Geography', y='Balance', data=churn, ax=axes[2, 0])
axes[2, 0].set_title('Balance by Geography')
# Боксплот зарплаты по полу
sns.boxplot(x='Gender', y='EstimatedSalary', data=churn, ax=axes[2, 1])
axes[2, 1].set_title('Estimated Salary by Gender')
# Круговая диаграмма для колонки Exited (отточников)
exited_counts = churn['Exited'].value_counts()
axes[3, 0].pie(exited_counts, labels=exited_counts.index, autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff'])
axes[3, 0].set_title('Exited Distribution')
# Распределение активных членов по странам
sns.countplot(x='Geography', hue='IsActiveMember', data=churn, ax=axes[3, 1])
axes[3, 1].set_title('Active Members by Geography')
# Подгонка и отображение графиков
plt.tight_layout()
plt.show()
Кодирование¶
В данных есть категориальные и количественные признаки. Без кодировщика не обойтись. Ко всему датасету применим One-Hot и будем использовать для всех моделей.
Далее будем обучать следующие модели:
- Логистическую регрессию
- SVM
- Решающее дерево
- Случайный лес
- Бустинг
# One-Hot для логрега (для зелени тоже подходит)
churn = pd.get_dummies(churn, drop_first=True)
print(churn.shape)
churn.head()
(10000, 12)
CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | Geography_Germany | Geography_Spain | Gender_Male | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 619 | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 | False | False | False |
1 | 608 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 | False | True | False |
2 | 502 | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 | False | False | False |
3 | 699 | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 | False | False | False |
4 | 850 | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 | False | True | False |
Семплирование¶
Делим на выборки:
features = churn.drop(['Exited'], axis=1)
target = churn['Exited']
# мощность классов
target.value_counts()
count | |
---|---|
Exited | |
0 | 7963 |
1 | 2037 |
target.mean()
0.2037
# отделяем 20% - пятую часть всего - на тестовую выборку
X_train_valid, X_test, y_train_valid, y_test = train_test_split(features, target,
test_size=0.2,
random_state=random_state)
# отделяем 25% - четвертую часть трейн+валид - на валидирующую выборку
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid,
test_size=0.25,
random_state=random_state)
s1 = y_train.size
s2 = y_valid.size
s3 = y_test.size
print('Разбиение на выборки train:valid:test в соотношении '
+ str(round(s1/s3)) + ':' + str(round(s2/s3)) + ':' + str(round(s3/s3)))
Разбиение на выборки train:valid:test в соотношении 3:1:1
targets = [y_train, y_train_valid, y_valid, y_test]
names = ['train:', 'train+valid:', 'valid:', 'test:']
print('Баланс классов на разбиениях:\n')
i = 0
for target in targets:
pc = target.mean()
print(names[i], pc.round(3))
i += 1
Баланс классов на разбиениях: train: 0.201 train+valid: 0.202 valid: 0.204 test: 0.212
Масштабирование¶
- Среди моделей, выбранных для исследования, есть линейные; Качество линейных алгоритмов зависит от масштаба данных. Признаки должны быть нормализованы.
- Если масштаб одного признака сильно превосходит масштаб других, то качество может резко упасть.
- Для нормализации используем стандартизацию признаков: возьмем набор значений признака на всех объектах, вычислим их среднее значение и стандартное отклонение.
- После этого из всех значений признака вычтем среднее, и затем полученную разность поделим на стандартное отклонение. Сделает это StandardScaler()...
# Выделяем количественные признаки для стандартизации
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary']
# Параметры стандартизации получим на трейне
scaler = StandardScaler()
scaler.fit(X_train[numeric])
# Преобразуем все выборки на основе параметров, полученных выше
X_train[numeric] = scaler.transform(X_train[numeric])
X_valid[numeric] = scaler.transform(X_valid[numeric])
X_test[numeric] = scaler.transform(X_test[numeric])
X_train_valid[numeric] = scaler.fit_transform(X_train_valid[numeric])
X_train[numeric].describe().round(3)
CreditScore | Age | Tenure | Balance | EstimatedSalary | |
---|---|---|---|---|---|
count | 6000.000 | 6000.000 | 6000.000 | 6000.000 | 6000.000 |
mean | 0.000 | 0.000 | 0.000 | -0.000 | -0.000 |
std | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
min | -3.136 | -1.994 | -1.730 | -1.236 | -1.753 |
25% | -0.690 | -0.663 | -0.695 | -1.236 | -0.860 |
50% | 0.007 | -0.187 | -0.004 | 0.335 | 0.009 |
75% | 0.693 | 0.478 | 1.031 | 0.821 | 0.855 |
max | 2.067 | 5.042 | 1.721 | 2.807 | 1.729 |
- Значения по выбранным количественным признакам теперь выглядят немного неадекватно,
- зато с нулевым средним и среднеквадратичным, равным 1.
# Функция для оценки модели
def calc_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test):
# Обучение
y_train_pred = model.predict(X_train)
y_train_proba = model.predict_proba(X_train)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_train)
# Валидация
y_valid_pred = model.predict(X_valid)
y_valid_proba = model.predict_proba(X_valid)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_valid)
# Тестирование
y_test_pred = model.predict(X_test)
y_test_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_test)
train_metrics = {
'precision': precision_score(y_train, y_train_pred),
'recall': recall_score(y_train, y_train_pred),
'f1': f1_score(y_train, y_train_pred),
'roc_auc': roc_auc_score(y_train, y_train_proba)
}
valid_metrics = {
'precision': precision_score(y_valid, y_valid_pred),
'recall': recall_score(y_valid, y_valid_pred),
'f1': f1_score(y_valid, y_valid_pred),
'roc_auc': roc_auc_score(y_valid, y_valid_proba)
}
test_metrics = {
'precision': precision_score(y_test, y_test_pred),
'recall': recall_score(y_test, y_test_pred),
'f1': f1_score(y_test, y_test_pred),
'roc_auc': roc_auc_score(y_test, y_test_proba)
}
return train_metrics, valid_metrics, test_metrics
def print_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test):
res = calc_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test)
metrics = pd.DataFrame(res, index=['train', 'valid', 'test']).round(3)
return metrics
%%time
"""
Ручной перебор всех комбинации параметров
- наилучшию модель выбираем по метрике ROC-AUC
"""
param_grid = {
'penalty': ['l2', 'l1', 'elasticnet'], # Тип регурелизации
# 'solver': ['lbfgs', 'liblinear', 'saga'],
'C': np.linspace(0.001, 2, 50) # Параметр который отвечает за вес этих поправок
}
grid_search = GridSearchCV(LogisticRegression(random_state=42),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1)
grid_search.fit(X_train_valid, y_train_valid)
best_logreg = grid_search.best_estimator_
print(best_logreg.get_params())
{'C': 0.04179591836734694, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False} CPU times: user 632 ms, sys: 143 ms, total: 774 ms Wall time: 7.9 s
res_logreg = print_metrics(best_logreg, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_logreg
precision | recall | f1 | roc_auc | |
---|---|---|---|---|
train | 0.615 | 0.201 | 0.303 | 0.774 |
valid | 0.617 | 0.174 | 0.272 | 0.765 |
test | 0.644 | 0.201 | 0.306 | 0.754 |
# Получение коэффициентов модели
coef = best_logreg.coef_[0]
# Создание DataFrame для коэффициентов и признаков
coef_df = pd.DataFrame({
'Feature': X_train_valid.columns,
'Coefficient_logreg': coef.round(3)
})
# Добавление столбца с интерпретацией
coef_df['Interpretation_logreg'] = coef_df['Coefficient_logreg'].apply(lambda x: 'Positive' if x > 0 else 'Negative')
Интерпретация Важности Признаков¶
Для линейной модели важность признаков можно оценить по значению весов
- Positive: Если коэффициент положительный, это означает, что увеличение значения признака увеличивает вероятность положительного исхода.
- Negative: Если коэффициент отрицательный, это означает, что увеличение значения признака уменьшает вероятность положительного исхода.
coef_df
Feature | Coefficient_logreg | Interpretation_logreg | |
---|---|---|---|
0 | CreditScore | -0.064 | Negative |
1 | Age | 0.738 | Positive |
2 | Tenure | -0.032 | Negative |
3 | Balance | 0.169 | Positive |
4 | NumOfProducts | -0.128 | Negative |
5 | HasCrCard | -0.019 | Negative |
6 | IsActiveMember | -0.975 | Negative |
7 | EstimatedSalary | 0.047 | Positive |
8 | Geography_Germany | 0.707 | Positive |
9 | Geography_Spain | 0.021 | Positive |
10 | Gender_Male | -0.505 | Negative |
Метод Опорных Векторов¶
%%time
"""
- Будем исходить из предположений что в данных есть нелинейность, проверим линейное ядро и полиномиальное
- probability = True : На выходе вероятностное расспределение
'kernel': 'poly' говорит нам о том что полиномиалное ядно дает лучше результат
Из этого мы однозначно делаем вывод что в данных есть нелинейность
"""
param_grid = {
# 'C': np.linspace(0.001, 2, 50),
# 'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'kernel': ['linear', 'poly']
}
grid_search = GridSearchCV(SVC(random_state=42, probability=True, gamma='scale'),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1)
grid_search.fit(X_train_valid, y_train_valid)
best_SVC = grid_search.best_estimator_
print(best_SVC.get_params())
{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'poly', 'max_iter': -1, 'probability': True, 'random_state': 42, 'shrinking': True, 'tol': 0.001, 'verbose': False} CPU times: user 9.52 s, sys: 305 ms, total: 9.82 s Wall time: 52.1 s
res_SVC = print_metrics(best_SVC, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_SVC
precision | recall | f1 | roc_auc | |
---|---|---|---|---|
train | 0.856 | 0.247 | 0.383 | 0.825 |
valid | 0.885 | 0.246 | 0.385 | 0.841 |
test | 0.850 | 0.241 | 0.376 | 0.812 |
Интерпритация Важность Признаков¶
Permutation Importance
- Интерпретировать веса у полиномиальной модели как с линейной мы не сможем
- Воспользуемся другим подходом permutation importance
- Она смотрит на важность признака в контексте
- Приставим приннак h
- PI перемешивает все значение в этом столбце (получается что он испортил колонку)
- Обучает модель с учетом испорченной фичи
- запоминаем качество модели и запоминаем ее
- анологично делаем для всех остальных колонок
- чем сильнее падает качество при перемешенной колонки тем больше вклад признака в модели
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(best_SVC, X_test, y_test)
features = np.array(X_test.columns)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(features[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance");
Решающее Дерево¶
Параметры
- criterion : Критерии на основе которого происходит разбиение на два класса , отвечает за характер разбиения
- max_depth : максимальная глубина дерева
- min_samples_leaf : Минимальное количество рядов в предсказаниях
%%time
param_grid = {
'criterion': ['gini', 'entropy', 'log_loss'],
# 'splitter': ['best', 'random'],
'max_depth': range(1, 11),
# 'min_samples_split': range(2, 10),
'min_samples_leaf': range(2, 10)
}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1)
grid_search.fit(X_train_valid, y_train_valid)
best_tree = grid_search.best_estimator_
print(best_tree.get_params())
{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'entropy', 'max_depth': 6, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 9, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': 42, 'splitter': 'best'} CPU times: user 1.17 s, sys: 127 ms, total: 1.29 s Wall time: 31.1 s
res_tree = print_metrics(best_tree, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_tree
precision | recall | f1 | roc_auc | |
---|---|---|---|---|
train | 0.767 | 0.485 | 0.594 | 0.868 |
valid | 0.813 | 0.459 | 0.587 | 0.869 |
test | 0.744 | 0.440 | 0.553 | 0.841 |
def plot_importance(model, X):
# Получение важности признаков
feature_importances = model.feature_importances_
# Создание DataFrame для важности признаков
importance_df = pd.DataFrame({
'Feature': X.columns,
'Importance': feature_importances
})
# Сортировка DataFrame по важности
importance_df = importance_df.sort_values(by='Importance', ascending=False)
# Визуализация важности признаков
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importances in Decision Tree')
plt.show()
plot_importance(best_tree, X_train_valid)
!pip install optuna -qqq
import optuna
def objective(trial):
param = {
'criterion': trial.suggest_categorical("criterion", ['gini', 'entropy', 'log_loss']),
'n_estimators': trial.suggest_int("n_estimators", 10, 100),
"max_depth": trial.suggest_int("max_depth", 2, 10), # сократить кол-во деревьев
'min_samples_split': trial.suggest_int("min_samples_split", 2, 10),
'min_samples_leaf': trial.suggest_int("min_samples_leaf", 2, 10),
}
model = RandomForestClassifier(**param, random_state=random_state)
model.fit(X_train, y_train)
preds = model.predict_proba(X_valid)[:,1]
auc = roc_auc_score(y_valid, preds)
return auc
study = optuna.create_study(direction="maximize", study_name='RandomForestClassifier')
study.optimize(objective, n_trials=10) # попробовать увеличить
print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial
print("Value: {}".format(trial.value))
print("Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
[I 2024-08-14 03:46:14,901] A new study created in memory with name: RandomForestClassifier [I 2024-08-14 03:46:15,008] Trial 0 finished with value: 0.8334212486754861 and parameters: {'criterion': 'log_loss', 'n_estimators': 15, 'max_depth': 5, 'min_samples_split': 8, 'min_samples_leaf': 8}. Best is trial 0 with value: 0.8334212486754861. [I 2024-08-14 03:46:15,213] Trial 1 finished with value: 0.8172085799204443 and parameters: {'criterion': 'gini', 'n_estimators': 38, 'max_depth': 2, 'min_samples_split': 8, 'min_samples_leaf': 10}. Best is trial 0 with value: 0.8334212486754861. [I 2024-08-14 03:46:15,408] Trial 2 finished with value: 0.8193139210088363 and parameters: {'criterion': 'log_loss', 'n_estimators': 17, 'max_depth': 3, 'min_samples_split': 5, 'min_samples_leaf': 9}. Best is trial 0 with value: 0.8334212486754861. [I 2024-08-14 03:46:16,070] Trial 3 finished with value: 0.8264767078326399 and parameters: {'criterion': 'log_loss', 'n_estimators': 61, 'max_depth': 2, 'min_samples_split': 3, 'min_samples_leaf': 6}. Best is trial 0 with value: 0.8334212486754861. [I 2024-08-14 03:46:16,818] Trial 4 finished with value: 0.8437150555794624 and parameters: {'criterion': 'log_loss', 'n_estimators': 53, 'max_depth': 5, 'min_samples_split': 2, 'min_samples_leaf': 7}. Best is trial 4 with value: 0.8437150555794624. [I 2024-08-14 03:46:18,630] Trial 5 finished with value: 0.8498282566079176 and parameters: {'criterion': 'gini', 'n_estimators': 68, 'max_depth': 9, 'min_samples_split': 6, 'min_samples_leaf': 9}. Best is trial 5 with value: 0.8498282566079176. [I 2024-08-14 03:46:18,932] Trial 6 finished with value: 0.8170481729803764 and parameters: {'criterion': 'log_loss', 'n_estimators': 13, 'max_depth': 3, 'min_samples_split': 9, 'min_samples_leaf': 7}. Best is trial 5 with value: 0.8498282566079176. [I 2024-08-14 03:46:19,959] Trial 7 finished with value: 0.8418210197871214 and parameters: {'criterion': 'log_loss', 'n_estimators': 65, 'max_depth': 4, 'min_samples_split': 9, 'min_samples_leaf': 5}. Best is trial 5 with value: 0.8498282566079176. [I 2024-08-14 03:46:20,197] Trial 8 finished with value: 0.8422798761781811 and parameters: {'criterion': 'log_loss', 'n_estimators': 15, 'max_depth': 6, 'min_samples_split': 8, 'min_samples_leaf': 4}. Best is trial 5 with value: 0.8498282566079176. [I 2024-08-14 03:46:21,125] Trial 9 finished with value: 0.8426600714736308 and parameters: {'criterion': 'log_loss', 'n_estimators': 96, 'max_depth': 4, 'min_samples_split': 5, 'min_samples_leaf': 10}. Best is trial 5 with value: 0.8498282566079176.
Number of finished trials: 10 Best trial: Value: 0.8498282566079176 Params: criterion: gini n_estimators: 68 max_depth: 9 min_samples_split: 6 min_samples_leaf: 9
best_forest = RandomForestClassifier(**study.best_params, random_state=random_state)
best_forest.fit(X_train_valid, y_train_valid)
res_forest = print_metrics(best_forest, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_forest
precision | recall | f1 | roc_auc | |
---|---|---|---|---|
train | 0.863 | 0.463 | 0.603 | 0.914 |
valid | 0.894 | 0.477 | 0.622 | 0.908 |
test | 0.795 | 0.411 | 0.542 | 0.857 |
Интерпретация Важности Признаков¶
plot_importance(best_forest, X_train_valid)
plot_tree(best_forest.estimators_[2], 'random_forest_tree', X_train_valid)
Градиентный Бустинг¶
Выбор гиперпараметров
learning_Rate : Скорость обучения, определяющая, насколько сильно обновляются веса модели на каждом шаге
max_depth : Максимальная глубина деревьев решений
l2_leaf_reg : Регуляризация L2 для листьев деревьев
subsample : Доля выборки, используемая для обучения каждого дерева
random_strength
- Параметр, отвечающий за случайность в процессе построения деревьев
- Он влияет на то, насколько сильно случайные изменения влияют на процесс разделения данных в каждом узле дерева
min_data_in_leaf : Минимальное количество данных, необходимых для создания листа в дереве.
from catboost import CatBoostClassifier
Определение целевой функции¶
Нам нужно задать функцию которую мы будем оптимитизировать
def objective(trial):
param = {
"learning_rate": trial.suggest_float('learning_rate', 0.01, 0.9),
"max_depth": trial.suggest_int("max_depth", 2, 10),
"l2_leaf_reg":trial.suggest_float('l2_leaf_reg', 0.01, 2),
"subsample": trial.suggest_float('subsample', 0.01, 1),
"random_strength": trial.suggest_float('random_strength', 1, 200),
"min_data_in_leaf":trial.suggest_float('min_data_in_leaf', 1, 500)
}
cat = CatBoostClassifier(
logging_level="Silent",
eval_metric="AUC",
grow_policy="Lossguide",
random_seed=42,
**param)
cat.fit(X_train, y_train,
eval_set=(X_valid, y_valid),
verbose=False,
early_stopping_rounds=10
)
preds = cat.predict_proba(X_valid)[:,1]
auc = roc_auc_score(y_valid, preds)
return auc
Создаем объект исследования и запускаем оптимизацию
study = optuna.create_study(direction="maximize", study_name='CatBoostClassifier')
study.optimize(objective, n_trials=100) # попробуйте увеличить
print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial
print("Value: {}".format(trial.value))
print("Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
[I 2024-08-14 03:46:44,732] A new study created in memory with name: CatBoostClassifier [I 2024-08-14 03:46:44,948] Trial 0 finished with value: 0.5 and parameters: {'learning_rate': 0.2165534023121395, 'max_depth': 2, 'l2_leaf_reg': 1.5477322039068058, 'subsample': 0.029345419918240072, 'random_strength': 6.5809964961892495, 'min_data_in_leaf': 477.52416517082804}. Best is trial 0 with value: 0.5. [I 2024-08-14 03:46:45,339] Trial 1 finished with value: 0.8372625321777863 and parameters: {'learning_rate': 0.789728112113311, 'max_depth': 4, 'l2_leaf_reg': 0.6004269201561995, 'subsample': 0.55359191980997, 'random_strength': 46.57861106355321, 'min_data_in_leaf': 425.7303912933811}. Best is trial 1 with value: 0.8372625321777863. [I 2024-08-14 03:46:45,992] Trial 2 finished with value: 0.8485835604479672 and parameters: {'learning_rate': 0.4634724182719374, 'max_depth': 6, 'l2_leaf_reg': 0.575436612277608, 'subsample': 0.42186545862300934, 'random_strength': 123.97404853733563, 'min_data_in_leaf': 98.4208597017056}. Best is trial 2 with value: 0.8485835604479672. [I 2024-08-14 03:46:48,201] Trial 3 finished with value: 0.8515032752320888 and parameters: {'learning_rate': 0.09543005782339105, 'max_depth': 9, 'l2_leaf_reg': 1.2167151901623765, 'subsample': 0.06245537706270228, 'random_strength': 61.228218388468974, 'min_data_in_leaf': 38.39023875452006}. Best is trial 3 with value: 0.8515032752320888. [I 2024-08-14 03:46:48,708] Trial 4 finished with value: 0.8406711796542305 and parameters: {'learning_rate': 0.7331013466248556, 'max_depth': 6, 'l2_leaf_reg': 0.012765794151882687, 'subsample': 0.14335624784813372, 'random_strength': 73.50485090264708, 'min_data_in_leaf': 380.00905654802966}. Best is trial 3 with value: 0.8515032752320888. [I 2024-08-14 03:46:50,233] Trial 5 finished with value: 0.8435615893243013 and parameters: {'learning_rate': 0.23370405202769556, 'max_depth': 9, 'l2_leaf_reg': 0.16497446248632272, 'subsample': 0.8752850397650734, 'random_strength': 111.64581392121151, 'min_data_in_leaf': 388.6064442254268}. Best is trial 3 with value: 0.8515032752320888. [I 2024-08-14 03:46:51,271] Trial 6 finished with value: 0.8226369667047633 and parameters: {'learning_rate': 0.04122993099237723, 'max_depth': 2, 'l2_leaf_reg': 0.20284874078763926, 'subsample': 0.6367717313128627, 'random_strength': 161.4305343264782, 'min_data_in_leaf': 18.271195536437006}. Best is trial 3 with value: 0.8515032752320888. [I 2024-08-14 03:46:51,778] Trial 7 finished with value: 0.8371075235482015 and parameters: {'learning_rate': 0.8582564268296371, 'max_depth': 7, 'l2_leaf_reg': 0.6159108475992469, 'subsample': 0.6430359063610354, 'random_strength': 194.77206434441388, 'min_data_in_leaf': 395.81326570058985}. Best is trial 3 with value: 0.8515032752320888. [I 2024-08-14 03:46:52,106] Trial 8 finished with value: 0.8514030208945464 and parameters: {'learning_rate': 0.6220500822136842, 'max_depth': 5, 'l2_leaf_reg': 0.6614202111677746, 'subsample': 0.3763622594650606, 'random_strength': 1.5372606636099502, 'min_data_in_leaf': 210.81684346685665}. Best is trial 3 with value: 0.8515032752320888. [I 2024-08-14 03:46:53,023] Trial 9 finished with value: 0.8379396345498039 and parameters: {'learning_rate': 0.7670675853970256, 'max_depth': 8, 'l2_leaf_reg': 1.8403233110251145, 'subsample': 0.33221049473597575, 'random_strength': 76.59908186125371, 'min_data_in_leaf': 220.27667189607422}. Best is trial 3 with value: 0.8515032752320888. [I 2024-08-14 03:46:53,795] Trial 10 finished with value: 0.8352975471619539 and parameters: {'learning_rate': 0.03939269976022153, 'max_depth': 10, 'l2_leaf_reg': 1.2239191289705484, 'subsample': 0.1971957977790358, 'random_strength': 45.97984119612045, 'min_data_in_leaf': 117.68547624545266}. Best is trial 3 with value: 0.8515032752320888. [I 2024-08-14 03:46:54,182] Trial 11 finished with value: 0.8534204466407855 and parameters: {'learning_rate': 0.5292947899032613, 'max_depth': 4, 'l2_leaf_reg': 1.0562715822887339, 'subsample': 0.29238467023057874, 'random_strength': 1.0080155926863474, 'min_data_in_leaf': 217.05912068009718}. Best is trial 11 with value: 0.8534204466407855. [I 2024-08-14 03:46:54,925] Trial 12 finished with value: 0.854156159240905 and parameters: {'learning_rate': 0.46667965752333035, 'max_depth': 4, 'l2_leaf_reg': 1.1034539045748595, 'subsample': 0.2177545543283828, 'random_strength': 33.69191424519235, 'min_data_in_leaf': 287.6354438748095}. Best is trial 12 with value: 0.854156159240905. [I 2024-08-14 03:46:55,261] Trial 13 finished with value: 0.8481239328696957 and parameters: {'learning_rate': 0.47388030842524104, 'max_depth': 4, 'l2_leaf_reg': 0.9841376098092054, 'subsample': 0.23796767505443062, 'random_strength': 30.559857725987595, 'min_data_in_leaf': 304.8772947881302}. Best is trial 12 with value: 0.854156159240905. [I 2024-08-14 03:46:55,774] Trial 14 finished with value: 0.8512302749590884 and parameters: {'learning_rate': 0.5677435867325714, 'max_depth': 4, 'l2_leaf_reg': 0.9858935858858828, 'subsample': 0.2880787724849706, 'random_strength': 19.641914653188316, 'min_data_in_leaf': 296.50997464128113}. Best is trial 12 with value: 0.854156159240905. [I 2024-08-14 03:46:56,213] Trial 15 finished with value: 0.8494503748741036 and parameters: {'learning_rate': 0.37166572152717614, 'max_depth': 3, 'l2_leaf_reg': 1.4583249930757494, 'subsample': 0.9955315242420071, 'random_strength': 88.02174705220291, 'min_data_in_leaf': 179.3738772355934}. Best is trial 12 with value: 0.854156159240905. [I 2024-08-14 03:46:57,055] Trial 16 finished with value: 0.8559885000562966 and parameters: {'learning_rate': 0.3210310795870268, 'max_depth': 5, 'l2_leaf_reg': 1.9094658324753917, 'subsample': 0.4804851220395184, 'random_strength': 33.49522734292147, 'min_data_in_leaf': 287.9109110478402}. Best is trial 16 with value: 0.8559885000562966. [I 2024-08-14 03:46:57,763] Trial 17 finished with value: 0.847795407117441 and parameters: {'learning_rate': 0.35378097903239786, 'max_depth': 5, 'l2_leaf_reg': 1.9401989755988598, 'subsample': 0.48416569735771703, 'random_strength': 40.07318375520298, 'min_data_in_leaf': 303.24367032588754}. Best is trial 16 with value: 0.8559885000562966. [I 2024-08-14 03:46:58,649] Trial 18 finished with value: 0.8543381594229051 and parameters: {'learning_rate': 0.3270310046948177, 'max_depth': 5, 'l2_leaf_reg': 1.6932697259348661, 'subsample': 0.79447929672279, 'random_strength': 136.12105973617452, 'min_data_in_leaf': 329.93271902824506}. Best is trial 16 with value: 0.8559885000562966. [I 2024-08-14 03:46:59,576] Trial 19 finished with value: 0.8500765788901381 and parameters: {'learning_rate': 0.2708829352692093, 'max_depth': 7, 'l2_leaf_reg': 1.7038356958216352, 'subsample': 0.7632729569486497, 'random_strength': 138.4597489310068, 'min_data_in_leaf': 343.8985733074137}. Best is trial 16 with value: 0.8559885000562966. [I 2024-08-14 03:47:00,116] Trial 20 finished with value: 0.8531150565048871 and parameters: {'learning_rate': 0.3626267613161781, 'max_depth': 5, 'l2_leaf_reg': 1.9795846165177058, 'subsample': 0.7827032985899299, 'random_strength': 142.73839861496384, 'min_data_in_leaf': 493.0493076589869}. Best is trial 16 with value: 0.8559885000562966. [I 2024-08-14 03:47:00,940] Trial 21 finished with value: 0.8563910597808904 and parameters: {'learning_rate': 0.16961344589594285, 'max_depth': 3, 'l2_leaf_reg': 1.649834982836516, 'subsample': 0.5749517198100278, 'random_strength': 99.89102990362149, 'min_data_in_leaf': 271.7012421453372}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:02,167] Trial 22 finished with value: 0.8344091394938851 and parameters: {'learning_rate': 0.14552950442407062, 'max_depth': 3, 'l2_leaf_reg': 1.6649232359966202, 'subsample': 0.5985514802507333, 'random_strength': 169.99871764933434, 'min_data_in_leaf': 261.3080549135861}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:04,114] Trial 23 finished with value: 0.8545186172304817 and parameters: {'learning_rate': 0.15793176071004453, 'max_depth': 3, 'l2_leaf_reg': 1.3526112935873842, 'subsample': 0.7175761381145475, 'random_strength': 113.99285178004443, 'min_data_in_leaf': 352.6566495237683}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:06,560] Trial 24 finished with value: 0.8553638384146858 and parameters: {'learning_rate': 0.1556921373541092, 'max_depth': 3, 'l2_leaf_reg': 1.4115008286364423, 'subsample': 0.480391200689659, 'random_strength': 100.39425793454086, 'min_data_in_leaf': 350.01711545220127}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:08,806] Trial 25 finished with value: 0.8480560683950514 and parameters: {'learning_rate': 0.16409141193428203, 'max_depth': 2, 'l2_leaf_reg': 1.496137374339839, 'subsample': 0.46620575811523923, 'random_strength': 91.45570082665543, 'min_data_in_leaf': 163.57983997102036}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:10,275] Trial 26 finished with value: 0.8544893121164308 and parameters: {'learning_rate': 0.2837819547192468, 'max_depth': 3, 'l2_leaf_reg': 1.8020953679362315, 'subsample': 0.5189916467893968, 'random_strength': 101.08298523619007, 'min_data_in_leaf': 253.22233056776207}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:10,680] Trial 27 finished with value: 0.8211246685822957 and parameters: {'learning_rate': 0.1044556356752651, 'max_depth': 2, 'l2_leaf_reg': 1.3459091803909486, 'subsample': 0.431619565523395, 'random_strength': 59.294142417791434, 'min_data_in_leaf': 449.8387981624195}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:10,968] Trial 28 finished with value: 0.8035146085993543 and parameters: {'learning_rate': 0.21613547632724206, 'max_depth': 3, 'l2_leaf_reg': 1.6092822334057308, 'subsample': 0.5676045993284018, 'random_strength': 72.80957980792742, 'min_data_in_leaf': 263.0893593280764}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:11,494] Trial 29 finished with value: 0.8547792785080921 and parameters: {'learning_rate': 0.2086592133839792, 'max_depth': 2, 'l2_leaf_reg': 0.819697698152623, 'subsample': 0.6834571224386156, 'random_strength': 19.984681853813314, 'min_data_in_leaf': 453.6832846244588}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:12,284] Trial 30 finished with value: 0.8367967351018197 and parameters: {'learning_rate': 0.3944960136703665, 'max_depth': 7, 'l2_leaf_reg': 1.8676872612497761, 'subsample': 0.38308326997649067, 'random_strength': 154.4491135123389, 'min_data_in_leaf': 332.6615450443436}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:12,895] Trial 31 finished with value: 0.8546836512938208 and parameters: {'learning_rate': 0.21596851649554094, 'max_depth': 2, 'l2_leaf_reg': 0.8664350791512772, 'subsample': 0.6869620770295259, 'random_strength': 24.469562538665006, 'min_data_in_leaf': 444.9760164258388}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:13,411] Trial 32 finished with value: 0.8548571684164904 and parameters: {'learning_rate': 0.28016951231760356, 'max_depth': 2, 'l2_leaf_reg': 0.806914526399829, 'subsample': 0.5400321582582661, 'random_strength': 16.595337724789662, 'min_data_in_leaf': 425.5676845159874}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:14,332] Trial 33 finished with value: 0.8465121515968972 and parameters: {'learning_rate': 0.2943832841669326, 'max_depth': 3, 'l2_leaf_reg': 1.4002865433211302, 'subsample': 0.5244965679173971, 'random_strength': 55.18009770759243, 'min_data_in_leaf': 411.0478307080825}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:15,095] Trial 34 finished with value: 0.8532785481938023 and parameters: {'learning_rate': 0.41204275970364945, 'max_depth': 4, 'l2_leaf_reg': 1.5580897723724074, 'subsample': 0.588510970323773, 'random_strength': 119.79281757796326, 'min_data_in_leaf': 388.1140383915803}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:15,717] Trial 35 finished with value: 0.8286869303818456 and parameters: {'learning_rate': 0.08975166153572325, 'max_depth': 2, 'l2_leaf_reg': 1.2419086491253597, 'subsample': 0.44346103753859356, 'random_strength': 12.29947978876556, 'min_data_in_leaf': 364.58824440904306}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:16,582] Trial 36 finished with value: 0.852669310296429 and parameters: {'learning_rate': 0.250033922012678, 'max_depth': 6, 'l2_leaf_reg': 0.42864672108139, 'subsample': 0.5268262007033628, 'random_strength': 95.42973443090546, 'min_data_in_leaf': 429.3365607018824}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:17,024] Trial 37 finished with value: 0.8127665415801009 and parameters: {'learning_rate': 0.1691762432733271, 'max_depth': 3, 'l2_leaf_reg': 1.7829743976227164, 'subsample': 0.3765836616592204, 'random_strength': 105.88486237007761, 'min_data_in_leaf': 475.9380950106968}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:17,801] Trial 38 finished with value: 0.852903751208836 and parameters: {'learning_rate': 0.3144852057758997, 'max_depth': 5, 'l2_leaf_reg': 0.7468205513730214, 'subsample': 0.6288168983269199, 'random_strength': 81.03884687055434, 'min_data_in_leaf': 278.02630322886773}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:18,104] Trial 39 finished with value: 0.7429039208700227 and parameters: {'learning_rate': 0.0916001703115347, 'max_depth': 2, 'l2_leaf_reg': 0.3209407231862438, 'subsample': 0.48947279248425846, 'random_strength': 63.425336876240344, 'min_data_in_leaf': 312.50209443952326}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:18,299] Trial 40 finished with value: 0.5 and parameters: {'learning_rate': 0.6819438096729318, 'max_depth': 6, 'l2_leaf_reg': 1.1696390615306496, 'subsample': 0.02690876704491979, 'random_strength': 186.05663001448391, 'min_data_in_leaf': 237.24363431981527}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:18,628] Trial 41 finished with value: 0.8275224376919292 and parameters: {'learning_rate': 0.20539423781999722, 'max_depth': 2, 'l2_leaf_reg': 0.7875365906927447, 'subsample': 0.7197475219510165, 'random_strength': 14.4847764719263, 'min_data_in_leaf': 469.9772033903035}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:19,489] Trial 42 finished with value: 0.8467774399977791 and parameters: {'learning_rate': 0.11258151909406194, 'max_depth': 2, 'l2_leaf_reg': 0.8721010222525916, 'subsample': 0.6594836798807467, 'random_strength': 48.89095313702967, 'min_data_in_leaf': 402.73822377455843}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:20,194] Trial 43 finished with value: 0.8518981230845636 and parameters: {'learning_rate': 0.19221299265782907, 'max_depth': 3, 'l2_leaf_reg': 0.4867586865668939, 'subsample': 0.5571688603077394, 'random_strength': 31.93422189548409, 'min_data_in_leaf': 363.31928598127024}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:22,996] Trial 44 finished with value: 0.8487424250136115 and parameters: {'learning_rate': 0.039144856136337974, 'max_depth': 4, 'l2_leaf_reg': 0.9009428480096298, 'subsample': 0.8375833862722262, 'random_strength': 6.814579833825066, 'min_data_in_leaf': 432.94909658576285}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:23,939] Trial 45 finished with value: 0.8486205774341369 and parameters: {'learning_rate': 0.2510132838894006, 'max_depth': 2, 'l2_leaf_reg': 0.6907636418579387, 'subsample': 0.6280995182686069, 'random_strength': 24.31951969212401, 'min_data_in_leaf': 374.4772844339267}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:25,057] Trial 46 finished with value: 0.8486591367947299 and parameters: {'learning_rate': 0.4177823887952824, 'max_depth': 3, 'l2_leaf_reg': 0.5837206207668852, 'subsample': 0.10938321642670351, 'random_strength': 41.931702968776605, 'min_data_in_leaf': 178.84495282770186}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:25,696] Trial 47 finished with value: 0.823857756061146 and parameters: {'learning_rate': 0.019782576463977575, 'max_depth': 4, 'l2_leaf_reg': 1.9024552651741735, 'subsample': 0.40297843167975866, 'random_strength': 66.82247683667727, 'min_data_in_leaf': 316.8543796002272}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:26,796] Trial 48 finished with value: 0.8527418018943443 and parameters: {'learning_rate': 0.12284365746995335, 'max_depth': 8, 'l2_leaf_reg': 1.5884344189532063, 'subsample': 0.3276266788784489, 'random_strength': 51.29989880997462, 'min_data_in_leaf': 459.0054658582284}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:27,241] Trial 49 finished with value: 0.8540805828941422 and parameters: {'learning_rate': 0.3299775712117529, 'max_depth': 3, 'l2_leaf_reg': 1.0878543895513753, 'subsample': 0.6942586059012404, 'random_strength': 11.713670742817778, 'min_data_in_leaf': 495.2165248684829}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:27,758] Trial 50 finished with value: 0.8472201014573897 and parameters: {'learning_rate': 0.25324978223894085, 'max_depth': 4, 'l2_leaf_reg': 1.7363552106408617, 'subsample': 0.5877810779665728, 'random_strength': 37.972565443197155, 'min_data_in_leaf': 412.6914975963677}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:28,249] Trial 51 finished with value: 0.8474005592649662 and parameters: {'learning_rate': 0.19612243767630727, 'max_depth': 2, 'l2_leaf_reg': 0.8917975193826132, 'subsample': 0.6848753748976462, 'random_strength': 24.324814512268816, 'min_data_in_leaf': 436.96437349015264}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:28,631] Trial 52 finished with value: 0.8247153162407398 and parameters: {'learning_rate': 0.06857413116836153, 'max_depth': 2, 'l2_leaf_reg': 0.8093899327079714, 'subsample': 0.8856644877809534, 'random_strength': 22.15656724840708, 'min_data_in_leaf': 454.16888064861433}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:28,958] Trial 53 finished with value: 0.8552512450817537 and parameters: {'learning_rate': 0.2277359380751674, 'max_depth': 2, 'l2_leaf_reg': 1.0044730574652196, 'subsample': 0.5426082432903984, 'random_strength': 2.092346231014936, 'min_data_in_leaf': 283.86020680504697}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:29,458] Trial 54 finished with value: 0.8545217019793291 and parameters: {'learning_rate': 0.3128193984950219, 'max_depth': 3, 'l2_leaf_reg': 1.008665858224686, 'subsample': 0.47370734259106645, 'random_strength': 6.662500258650947, 'min_data_in_leaf': 283.74741280019464}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:29,754] Trial 55 finished with value: 0.8421464607905287 and parameters: {'learning_rate': 0.8773525297785807, 'max_depth': 2, 'l2_leaf_reg': 1.149059206077042, 'subsample': 0.5453409140242778, 'random_strength': 128.55684672230558, 'min_data_in_leaf': 233.31803747702202}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:30,145] Trial 56 finished with value: 0.8550931517033213 and parameters: {'learning_rate': 0.13656066626789937, 'max_depth': 2, 'l2_leaf_reg': 1.2919909470689186, 'subsample': 0.43933745123796614, 'random_strength': 1.7054069646826093, 'min_data_in_leaf': 62.17289685684315}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:30,493] Trial 57 finished with value: 0.8527880731270562 and parameters: {'learning_rate': 0.13041630646518754, 'max_depth': 10, 'l2_leaf_reg': 1.274378826982479, 'subsample': 0.34334490211186264, 'random_strength': 4.2565201685199305, 'min_data_in_leaf': 71.55126556167352}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:30,682] Trial 58 finished with value: 0.8514184446387836 and parameters: {'learning_rate': 0.16594689210860503, 'max_depth': 3, 'l2_leaf_reg': 1.4888320186002333, 'subsample': 0.4376508602553443, 'random_strength': 1.3885552941461277, 'min_data_in_leaf': 152.70173531249966}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:31,007] Trial 59 finished with value: 0.8540620744010574 and parameters: {'learning_rate': 0.2750010284884532, 'max_depth': 5, 'l2_leaf_reg': 1.2746287691142977, 'subsample': 0.5071138630239854, 'random_strength': 16.021246309606525, 'min_data_in_leaf': 8.92644330176222}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:31,275] Trial 60 finished with value: 0.8236294846464337 and parameters: {'learning_rate': 0.06580145985259861, 'max_depth': 4, 'l2_leaf_reg': 1.4207580932038155, 'subsample': 0.2559732452953547, 'random_strength': 34.10123336167225, 'min_data_in_leaf': 35.04352061528167}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:31,513] Trial 61 finished with value: 0.8484570857452214 and parameters: {'learning_rate': 0.23125001722207209, 'max_depth': 2, 'l2_leaf_reg': 0.951125702424557, 'subsample': 0.45236883411477885, 'random_strength': 16.37629361091796, 'min_data_in_leaf': 276.32032283099494}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:31,848] Trial 62 finished with value: 0.8491604084824422 and parameters: {'learning_rate': 0.18521171522349617, 'max_depth': 2, 'l2_leaf_reg': 1.0233833614681076, 'subsample': 0.6028311345426446, 'random_strength': 27.84478190364073, 'min_data_in_leaf': 212.63167127467798}. Best is trial 21 with value: 0.8563910597808904. [I 2024-08-14 03:47:32,139] Trial 63 finished with value: 0.8579966715559936 and parameters: {'learning_rate': 0.2275828875803787, 'max_depth': 2, 'l2_leaf_reg': 1.649969257251929, 'subsample': 0.41432801712540723, 'random_strength': 9.871440937029153, 'min_data_in_leaf': 339.4206240957627}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:32,353] Trial 64 finished with value: 0.8410583156345868 and parameters: {'learning_rate': 0.15443576863058872, 'max_depth': 3, 'l2_leaf_reg': 1.6356148561301187, 'subsample': 0.40829684886366496, 'random_strength': 8.920803659606767, 'min_data_in_leaf': 329.25270044657947}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:32,559] Trial 65 finished with value: 0.8475794746981188 and parameters: {'learning_rate': 0.52262191640972, 'max_depth': 2, 'l2_leaf_reg': 1.7570515967211897, 'subsample': 0.35209745591260677, 'random_strength': 79.31033248169595, 'min_data_in_leaf': 298.8152094762548}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:32,817] Trial 66 finished with value: 0.8545409816596259 and parameters: {'learning_rate': 0.3397594287720047, 'max_depth': 3, 'l2_leaf_reg': 1.51439591232287, 'subsample': 0.48690121670146136, 'random_strength': 113.64174712891179, 'min_data_in_leaf': 344.9705847573752}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:33,180] Trial 67 finished with value: 0.8527741917572427 and parameters: {'learning_rate': 0.2872741864314234, 'max_depth': 6, 'l2_leaf_reg': 1.9758241623536021, 'subsample': 0.30706088302917706, 'random_strength': 86.69768770901611, 'min_data_in_leaf': 195.60397604666105}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:33,542] Trial 68 finished with value: 0.8388542625830762 and parameters: {'learning_rate': 0.37734226459157394, 'max_depth': 7, 'l2_leaf_reg': 1.84386701582687, 'subsample': 0.5416787382022203, 'random_strength': 102.93491264871817, 'min_data_in_leaf': 100.73573695821503}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:33,798] Trial 69 finished with value: 0.8491079677520357 and parameters: {'learning_rate': 0.2396158685166121, 'max_depth': 3, 'l2_leaf_reg': 1.3497075412335753, 'subsample': 0.41533858440585675, 'random_strength': 9.011802604306009, 'min_data_in_leaf': 244.2984153117823}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:34,249] Trial 70 finished with value: 0.8551000923882279 and parameters: {'learning_rate': 0.1405771071007593, 'max_depth': 2, 'l2_leaf_reg': 1.6670425816135785, 'subsample': 0.5628243022632802, 'random_strength': 43.63893059165331, 'min_data_in_leaf': 265.0039335083252}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:34,463] Trial 71 finished with value: 0.8369964725896929 and parameters: {'learning_rate': 0.1346261793235442, 'max_depth': 2, 'l2_leaf_reg': 1.69505852020515, 'subsample': 0.5662727341567123, 'random_strength': 18.17730751314747, 'min_data_in_leaf': 261.9534789983721}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:34,721] Trial 72 finished with value: 0.8321156287257981 and parameters: {'learning_rate': 0.07724194746378436, 'max_depth': 2, 'l2_leaf_reg': 1.6529774456888673, 'subsample': 0.5130735404504633, 'random_strength': 43.04440541458727, 'min_data_in_leaf': 271.8748406602916}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:34,936] Trial 73 finished with value: 0.8534474381932009 and parameters: {'learning_rate': 0.2193541378472138, 'max_depth': 2, 'l2_leaf_reg': 1.55127358733355, 'subsample': 0.45844426802461946, 'random_strength': 1.2277571480166767, 'min_data_in_leaf': 323.54337938841485}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:35,129] Trial 74 finished with value: 0.8245140363784432 and parameters: {'learning_rate': 0.17096273713989574, 'max_depth': 3, 'l2_leaf_reg': 1.4299055079718532, 'subsample': 0.6103423491223143, 'random_strength': 35.52778769358011, 'min_data_in_leaf': 309.4874282846581}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:35,471] Trial 75 finished with value: 0.849687900535358 and parameters: {'learning_rate': 0.1424757474479632, 'max_depth': 2, 'l2_leaf_reg': 1.9040100631862928, 'subsample': 0.49178296476749883, 'random_strength': 29.861178645449915, 'min_data_in_leaf': 288.4169959906686}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:35,653] Trial 76 finished with value: 0.8166471556302065 and parameters: {'learning_rate': 0.10900450958876559, 'max_depth': 3, 'l2_leaf_reg': 1.1584526059635818, 'subsample': 0.5436945071994185, 'random_strength': 97.11608635264365, 'min_data_in_leaf': 252.08827881313073}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:36,111] Trial 77 finished with value: 0.8529130054553784 and parameters: {'learning_rate': 0.26344618455418334, 'max_depth': 2, 'l2_leaf_reg': 1.8036914133857063, 'subsample': 0.5759920935180337, 'random_strength': 149.97345964123704, 'min_data_in_leaf': 230.53437749650683}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:36,464] Trial 78 finished with value: 0.8554301605149063 and parameters: {'learning_rate': 0.30386516642561456, 'max_depth': 2, 'l2_leaf_reg': 1.5941173743706698, 'subsample': 0.388487385094279, 'random_strength': 12.04964504414562, 'min_data_in_leaf': 340.15331434948547}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:36,925] Trial 79 finished with value: 0.8556607454912538 and parameters: {'learning_rate': 0.3043185361985488, 'max_depth': 3, 'l2_leaf_reg': 1.6062543081685263, 'subsample': 0.38163010145401816, 'random_strength': 10.801512721309756, 'min_data_in_leaf': 336.9094072537885}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:37,511] Trial 80 finished with value: 0.8507459693900372 and parameters: {'learning_rate': 0.2967690808832415, 'max_depth': 4, 'l2_leaf_reg': 1.5769102540118054, 'subsample': 0.3683396074299298, 'random_strength': 119.50189785234694, 'min_data_in_leaf': 352.7530072728084}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:37,788] Trial 81 finished with value: 0.8495382902162564 and parameters: {'learning_rate': 0.3519982264605916, 'max_depth': 3, 'l2_leaf_reg': 1.491074642500498, 'subsample': 0.39339814284208235, 'random_strength': 9.046452450187513, 'min_data_in_leaf': 338.6181638900527}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:38,153] Trial 82 finished with value: 0.8492305865187221 and parameters: {'learning_rate': 0.442607098792248, 'max_depth': 2, 'l2_leaf_reg': 1.7262120187355963, 'subsample': 0.27234819741871114, 'random_strength': 12.955991756657548, 'min_data_in_leaf': 381.46339177847165}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:38,943] Trial 83 finished with value: 0.8509557323116645 and parameters: {'learning_rate': 0.18048848769077014, 'max_depth': 9, 'l2_leaf_reg': 1.6096921762481307, 'subsample': 0.31232826961375537, 'random_strength': 27.6479977169267, 'min_data_in_leaf': 362.9959768769905}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:39,536] Trial 84 finished with value: 0.8530302259115817 and parameters: {'learning_rate': 0.22708933829872321, 'max_depth': 3, 'l2_leaf_reg': 1.6632316572246608, 'subsample': 0.4206405376875545, 'random_strength': 20.498316514413297, 'min_data_in_leaf': 297.40518856402844}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:40,270] Trial 85 finished with value: 0.844975946670862 and parameters: {'learning_rate': 0.3134824967461136, 'max_depth': 8, 'l2_leaf_reg': 1.458604426862953, 'subsample': 0.46636325980991755, 'random_strength': 172.65958600142233, 'min_data_in_leaf': 308.2224094733233}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:40,626] Trial 86 finished with value: 0.853733548648803 and parameters: {'learning_rate': 0.20516007929387697, 'max_depth': 4, 'l2_leaf_reg': 1.8081881256255257, 'subsample': 0.3669610320292627, 'random_strength': 71.7353570023819, 'min_data_in_leaf': 349.71334841544535}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:40,844] Trial 87 finished with value: 0.847503898351356 and parameters: {'learning_rate': 0.3850725922826087, 'max_depth': 2, 'l2_leaf_reg': 1.3738184718048245, 'subsample': 0.4394316413220218, 'random_strength': 13.073971750314525, 'min_data_in_leaf': 319.76257608276416}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:41,130] Trial 88 finished with value: 0.8509834950512916 and parameters: {'learning_rate': 0.2486495279955683, 'max_depth': 3, 'l2_leaf_reg': 1.3060721511513962, 'subsample': 0.5036611916707044, 'random_strength': 6.582515968741947, 'min_data_in_leaf': 288.24916069068973}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:41,464] Trial 89 finished with value: 0.8411153834882648 and parameters: {'learning_rate': 0.05111799564446984, 'max_depth': 2, 'l2_leaf_reg': 1.537356157859457, 'subsample': 0.3880066956376303, 'random_strength': 4.295324207554229, 'min_data_in_leaf': 266.43818936484587}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:41,615] Trial 90 finished with value: 0.809689504604759 and parameters: {'learning_rate': 0.14849530714748765, 'max_depth': 2, 'l2_leaf_reg': 1.604654407357084, 'subsample': 0.6488514855233791, 'random_strength': 109.07773151115863, 'min_data_in_leaf': 325.72222318586404}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:41,889] Trial 91 finished with value: 0.8535808535808537 and parameters: {'learning_rate': 0.2768348084629538, 'max_depth': 2, 'l2_leaf_reg': 1.7020206569000091, 'subsample': 0.5270113119832779, 'random_strength': 19.279543505451198, 'min_data_in_leaf': 390.8245718531585}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:42,108] Trial 92 finished with value: 0.8445248021519208 and parameters: {'learning_rate': 0.2980320322130359, 'max_depth': 2, 'l2_leaf_reg': 0.9413144021105265, 'subsample': 0.18491093410744158, 'random_strength': 23.462444073545914, 'min_data_in_leaf': 371.68996169817217}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:42,275] Trial 93 finished with value: 0.8438677506474116 and parameters: {'learning_rate': 0.8018049022359041, 'max_depth': 3, 'l2_leaf_reg': 0.7207307279562393, 'subsample': 0.4817377653790264, 'random_strength': 54.11215711265974, 'min_data_in_leaf': 247.83798275985313}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:42,629] Trial 94 finished with value: 0.8526831916662426 and parameters: {'learning_rate': 0.18976503763926902, 'max_depth': 5, 'l2_leaf_reg': 1.0778005788536056, 'subsample': 0.42051153997605584, 'random_strength': 15.934790667198342, 'min_data_in_leaf': 418.39615211072214}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:42,894] Trial 95 finished with value: 0.8465136939713211 and parameters: {'learning_rate': 0.3330384372789757, 'max_depth': 2, 'l2_leaf_reg': 1.9197824427277657, 'subsample': 0.5366007428124753, 'random_strength': 44.67016583545405, 'min_data_in_leaf': 200.40708572900584}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:43,205] Trial 96 finished with value: 0.8505886471988167 and parameters: {'learning_rate': 0.26241859653774113, 'max_depth': 3, 'l2_leaf_reg': 0.6304145123151321, 'subsample': 0.4542962823423417, 'random_strength': 37.17578672829132, 'min_data_in_leaf': 399.3636999400437}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:43,443] Trial 97 finished with value: 0.8340196899518935 and parameters: {'learning_rate': 0.09242477857682523, 'max_depth': 2, 'l2_leaf_reg': 0.0645769669423184, 'subsample': 0.5820960690872515, 'random_strength': 11.65162235424116, 'min_data_in_leaf': 358.5938499475734}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:43,772] Trial 98 finished with value: 0.8560116356726526 and parameters: {'learning_rate': 0.2238295228087367, 'max_depth': 2, 'l2_leaf_reg': 1.4538002550468356, 'subsample': 0.6177874774515071, 'random_strength': 26.070466696158935, 'min_data_in_leaf': 336.943504306358}. Best is trial 63 with value: 0.8579966715559936. [I 2024-08-14 03:47:44,246] Trial 99 finished with value: 0.8533078533078534 and parameters: {'learning_rate': 0.12342253771037937, 'max_depth': 3, 'l2_leaf_reg': 1.446914784821623, 'subsample': 0.6193773627486576, 'random_strength': 32.4812721291412, 'min_data_in_leaf': 336.8229861132391}. Best is trial 63 with value: 0.8579966715559936.
Number of finished trials: 100 Best trial: Value: 0.8579966715559936 Params: learning_rate: 0.2275828875803787 max_depth: 2 l2_leaf_reg: 1.649969257251929 subsample: 0.41432801712540723 random_strength: 9.871440937029153 min_data_in_leaf: 339.4206240957627
best_cat = CatBoostClassifier(**study.best_params, random_state=random_state)
best_cat.fit(X_train, y_train,
eval_set=(X_valid, y_valid),
verbose=False,
early_stopping_rounds=10
)
res_cat = print_metrics(best_cat, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_cat
precision | recall | f1 | roc_auc | |
---|---|---|---|---|
train | 0.786 | 0.468 | 0.587 | 0.877 |
valid | 0.825 | 0.464 | 0.594 | 0.855 |
test | 0.754 | 0.442 | 0.557 | 0.856 |
Интерпретация Важности Признаков¶
Для градиентного спуска можно использовать SHAP значении для интерпретации важности признаков
- Чем выше признак, тем он важнее
- Чем краснее точка, тем выше значение признака
SHAP value
- Значения SHAP представляют собой вклад каждого признака в прогноз модели.
- Положительные значения SHAP указывают на то, что признак увеличивает прогноз (в сторону положительного класса; отток клиента)
Расспределение Точек
- Распределение точек для каждого признака показывает изменчивость влияния этого признака на прогнозы для различных экземпляров.
- Более широкое распределение указывает на то, что признак имеет разное влияние в зависимости от других факторов.
import shap
explainer = shap.TreeExplainer(best_cat)
shap_values = explainer(X_train_valid)
shap.plots.beeswarm(shap_values)