Задача Look-a-Like¶

1 | Вводная¶

Задача look-a-like¶

Цель задачи look-a-like заключается в том что вы хотим найти похожих пользователей.

Это нужно для того чтобы можно было сегментировать пользователей и предпринимать какие то последующие действия на основе этой информации

Задача предсказания оттока¶

Будем строить модель для определения клиентов которые могут уйти

Задачу look-a-like будем решать на примере поиска сегмента клиентов, склонных к оттоку из банка.
Датасет содержит ретро-данные о клиентах, оттекших из банка - целевой сегмент. Аналогично - имеются данные о тех, кто не оттек.
Необходимо для любого другого клиента из тестовой выборки определить вероятность (склонность к оттоку).

Задача: построить модель классификации с предельно большим значением ROC-AUC

In [ ]:

Copied!





import pandas as pd
import numpy as np
import seaborn as sns
import random
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.utils import shuffle

import warnings
warnings.filterwarnings("ignore")

# Установка настроек для отображения всех колонок и строк при печати
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# заранее установим в константу random_state
random_state = 47

sns.set(style="whitegrid")
import pandas as pd
import numpy as np
import seaborn as sns
import random
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.utils import shuffle

import warnings
warnings.filterwarnings("ignore")

# Установка настроек для отображения всех колонок и строк при печати
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# заранее установим в константу random_state
random_state = 47

sns.set(style="whitegrid")

2 | Чтение данных¶

In [ ]:

Copied!

churn = pd.read_csv('/content/Churn_Modelling.csv')
print(churn.shape)
churn.head()
churn = pd.read_csv('/content/Churn_Modelling.csv')
print(churn.shape)
churn.head()

(10000, 14)

Out[ ]:

	RowNumber	CustomerId	Surname	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
0	1	15634602	Hargrave	619	France	Female	42	2	0.00	1	1	1	101348.88	1
1	2	15647311	Hill	608	Spain	Female	41	1	83807.86	1	0	1	112542.58	0
2	3	15619304	Onio	502	France	Female	42	8	159660.80	3	1	0	113931.57	1
3	4	15701354	Boni	699	France	Female	39	1	0.00	2	0	0	93826.63	0
4	5	15737888	Mitchell	850	Spain	Female	43	2	125510.82	1	1	1	79084.10	0

Признаки¶

RowNumber — индекс строки в данных
CustomerId — уникальный идентификатор клиента
Surname — фамилия
CreditScore — кредитный рейтинг
Geography — страна проживания
Gender — пол
Age — возраст
Tenure — сколько лет человек является клиентом банка
Balance — баланс на счёте
NumOfProducts — количество продуктов банка, используемых клиентом
HasCrCard — наличие кредитной карты
IsActiveMember — активность клиента
EstimatedSalary — предполагаемая зарплата

Целевой признак¶

Exited — факт ухода клиента (1 - Отток)

Сразу исключим ненужные столбцы, чтобы модели не переобучались под пользователей:

In [ ]:

Copied!

churn = churn.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
churn.head()
churn = churn.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
churn.head()

Out[ ]:

	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
0	619	France	Female	42	2	0.00	1	1	1	101348.88	1
1	608	Spain	Female	41	1	83807.86	1	0	1	112542.58	0
2	502	France	Female	42	8	159660.80	3	1	0	113931.57	1
3	699	France	Female	39	1	0.00	2	0	0	93826.63	0
4	850	Spain	Female	43	2	125510.82	1	1	1	79084.10	0

3 | Визуализация Данных¶

In [ ]:

Copied!





# Создание подграфиков
fig, axes = plt.subplots(4, 2, figsize=(10, 13))

# Гистограмма кредитных баллов
sns.histplot(data=churn, x='CreditScore', kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Credit Scores')

# Гистограмма баланса на счёте
sns.histplot(data=churn, x='Balance', kde=True, ax=axes[0, 1])
axes[0, 1].set_title('Distribution of Balance')

# Гистограмма зарплат
sns.histplot(data=churn, x='EstimatedSalary', kde=True, ax=axes[1, 0])
axes[1, 0].set_title('Distribution of EstimatedSalary')

# Гистограмма возраста
sns.histplot(data=churn, x='Age', kde=True, ax=axes[1, 1])
axes[1, 1].set_title('Distribution of Age')

# Боксплот баланса по странам
sns.boxplot(x='Geography', y='Balance', data=churn, ax=axes[2, 0])
axes[2, 0].set_title('Balance by Geography')

# Боксплот зарплаты по полу
sns.boxplot(x='Gender', y='EstimatedSalary', data=churn, ax=axes[2, 1])
axes[2, 1].set_title('Estimated Salary by Gender')

# Круговая диаграмма для колонки Exited (отточников)
exited_counts = churn['Exited'].value_counts()
axes[3, 0].pie(exited_counts, labels=exited_counts.index, autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff'])
axes[3, 0].set_title('Exited Distribution')

# Распределение активных членов по странам
sns.countplot(x='Geography', hue='IsActiveMember', data=churn, ax=axes[3, 1])
axes[3, 1].set_title('Active Members by Geography')

# Подгонка и отображение графиков
plt.tight_layout()
plt.show()
# Создание подграфиков
fig, axes = plt.subplots(4, 2, figsize=(10, 13))

# Гистограмма кредитных баллов
sns.histplot(data=churn, x='CreditScore', kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Credit Scores')

# Гистограмма баланса на счёте
sns.histplot(data=churn, x='Balance', kde=True, ax=axes[0, 1])
axes[0, 1].set_title('Distribution of Balance')

# Гистограмма зарплат
sns.histplot(data=churn, x='EstimatedSalary', kde=True, ax=axes[1, 0])
axes[1, 0].set_title('Distribution of EstimatedSalary')

# Гистограмма возраста
sns.histplot(data=churn, x='Age', kde=True, ax=axes[1, 1])
axes[1, 1].set_title('Distribution of Age')

# Боксплот баланса по странам
sns.boxplot(x='Geography', y='Balance', data=churn, ax=axes[2, 0])
axes[2, 0].set_title('Balance by Geography')

# Боксплот зарплаты по полу
sns.boxplot(x='Gender', y='EstimatedSalary', data=churn, ax=axes[2, 1])
axes[2, 1].set_title('Estimated Salary by Gender')

# Круговая диаграмма для колонки Exited (отточников)
exited_counts = churn['Exited'].value_counts()
axes[3, 0].pie(exited_counts, labels=exited_counts.index, autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff'])
axes[3, 0].set_title('Exited Distribution')

# Распределение активных членов по странам
sns.countplot(x='Geography', hue='IsActiveMember', data=churn, ax=axes[3, 1])
axes[3, 1].set_title('Active Members by Geography')

# Подгонка и отображение графиков
plt.tight_layout()
plt.show()

No description has been provided for this image

4 | Предобработка¶

Кодировка категориальный признаков¶

В данных есть категориальные и количественные признаки.
Ко всему датасету применим One-Hot Endoding и будем использовать для всех моделей

Далее будем обучать следующие модели:

Логистическую регрессию
SVM
Решающее дерево
Случайный лес
Бустинг

In [ ]:

Copied!





# One-Hot для логрега (для зелени тоже подходит)
churn = pd.get_dummies(churn, drop_first=True)
print(churn.shape)
churn.head()
# One-Hot для логрега (для зелени тоже подходит)
churn = pd.get_dummies(churn, drop_first=True)
print(churn.shape)
churn.head()

(10000, 12)

Out[ ]:

	CreditScore	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	Geography_Germany	Geography_Spain	Gender_Male
0	619	42	2	0.00	1	1	1	101348.88	1	False	False	False
1	608	41	1	83807.86	1	0	1	112542.58	0	False	True	False
2	502	42	8	159660.80	3	1	0	113931.57	1	False	False	False
3	699	39	1	0.00	2	0	0	93826.63	0	False	False	False
4	850	43	2	125510.82	1	1	1	79084.10	0	False	True	False

5 | Train/Test подвыборки¶

Делим на выборки:

In [ ]:

Copied!

features = churn.drop(['Exited'], axis=1)
target = churn['Exited']
features = churn.drop(['Exited'], axis=1)
target = churn['Exited']

In [ ]:

Copied!

# мощность классов
target.value_counts()
# мощность классов
target.value_counts()

Out[ ]:

	count
Exited
0	7963
1	2037

dtype: int64

In [ ]:

Copied!

target.mean()
target.mean()

Out[ ]:

0.2037

In [ ]:

Copied!





# отделяем 20% - пятую часть всего - на тестовую выборку
X_train_valid, X_test, y_train_valid, y_test = train_test_split(features, target,
                                                                            test_size=0.2,
                                                                            random_state=random_state)
# отделяем 25% - четвертую часть трейн+валид - на валидирующую выборку
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid,
                                                                              test_size=0.25,
                                                                              random_state=random_state)

s1 = y_train.size
s2 = y_valid.size
s3 = y_test.size
print('Разбиение на выборки train:valid:test в соотношении '
      + str(round(s1/s3)) + ':' + str(round(s2/s3)) + ':' + str(round(s3/s3)))
# отделяем 20% - пятую часть всего - на тестовую выборку
X_train_valid, X_test, y_train_valid, y_test = train_test_split(features, target,
                                                                            test_size=0.2,
                                                                            random_state=random_state)
# отделяем 25% - четвертую часть трейн+валид - на валидирующую выборку
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid,
                                                                              test_size=0.25,
                                                                              random_state=random_state)

s1 = y_train.size
s2 = y_valid.size
s3 = y_test.size
print('Разбиение на выборки train:valid:test в соотношении '
      + str(round(s1/s3)) + ':' + str(round(s2/s3)) + ':' + str(round(s3/s3)))

Разбиение на выборки train:valid:test в соотношении 3:1:1

In [ ]:

Copied!





targets = [y_train, y_train_valid, y_valid, y_test]
names = ['train:', 'train+valid:', 'valid:', 'test:']
print('Баланс классов на разбиениях:\n')
i = 0
for target in targets:
    pc = target.mean()
    print(names[i], pc.round(3))
    i += 1
targets = [y_train, y_train_valid, y_valid, y_test]
names = ['train:', 'train+valid:', 'valid:', 'test:']
print('Баланс классов на разбиениях:\n')
i = 0
for target in targets:
    pc = target.mean()
    print(names[i], pc.round(3))
    i += 1

Баланс классов на разбиениях:

train: 0.201
train+valid: 0.202
valid: 0.204
test: 0.212

6 | Маштабирование¶

Среди моделей, выбранных для исследования, есть линейные; Качество линейных алгоритмов зависит от масштаба данных. Признаки должны быть нормализованы.
Если масштаб одного признака сильно превосходит масштаб других, то качество может резко упасть.
Для нормализации используем стандартизацию признаков: возьмем набор значений признака на всех объектах, вычислим их среднее значение и стандартное отклонение.
После этого из всех значений признака вычтем среднее, и затем полученную разность поделим на стандартное отклонение. Сделает это StandardScaler()...

In [ ]:

Copied!





# Выделяем количественные признаки для стандартизации
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary']

# Параметры стандартизации получим на трейне
scaler = StandardScaler()
scaler.fit(X_train[numeric])

# Преобразуем все выборки на основе параметров, полученных выше
X_train[numeric] = scaler.transform(X_train[numeric])
X_valid[numeric] = scaler.transform(X_valid[numeric])
X_test[numeric] = scaler.transform(X_test[numeric])

X_train_valid[numeric] = scaler.fit_transform(X_train_valid[numeric])
# Выделяем количественные признаки для стандартизации
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary']

# Параметры стандартизации получим на трейне
scaler = StandardScaler()
scaler.fit(X_train[numeric])

# Преобразуем все выборки на основе параметров, полученных выше
X_train[numeric] = scaler.transform(X_train[numeric])
X_valid[numeric] = scaler.transform(X_valid[numeric])
X_test[numeric] = scaler.transform(X_test[numeric])

X_train_valid[numeric] = scaler.fit_transform(X_train_valid[numeric])

In [ ]:

Copied!

X_train[numeric].describe().round(3)
X_train[numeric].describe().round(3)

Out[ ]:

	CreditScore	Age	Tenure	Balance	EstimatedSalary
count	6000.000	6000.000	6000.000	6000.000	6000.000
mean	0.000	0.000	0.000	-0.000	-0.000
std	1.000	1.000	1.000	1.000	1.000
min	-3.136	-1.994	-1.730	-1.236	-1.753
25%	-0.690	-0.663	-0.695	-1.236	-0.860
50%	0.007	-0.187	-0.004	0.335	0.009
75%	0.693	0.478	1.031	0.821	0.855
max	2.067	5.042	1.721	2.807	1.729

Значения по выбранным количественным признакам теперь выглядят немного неадекватно,
зато с нулевым средним и среднеквадратичным, равным 1.

In [ ]:

Copied!





# Функция для оценки модели
def calc_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test):

    # Обучение
    y_train_pred = model.predict(X_train)
    y_train_proba = model.predict_proba(X_train)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_train)

    # Валидация
    y_valid_pred = model.predict(X_valid)
    y_valid_proba = model.predict_proba(X_valid)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_valid)

    # Тестирование
    y_test_pred = model.predict(X_test)
    y_test_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_test)

    train_metrics = {
        'precision': precision_score(y_train, y_train_pred),
        'recall': recall_score(y_train, y_train_pred),
        'f1': f1_score(y_train, y_train_pred),
        'roc_auc': roc_auc_score(y_train, y_train_proba)
    }

    valid_metrics = {
        'precision': precision_score(y_valid, y_valid_pred),
        'recall': recall_score(y_valid, y_valid_pred),
        'f1': f1_score(y_valid, y_valid_pred),
        'roc_auc': roc_auc_score(y_valid, y_valid_proba)
    }

    test_metrics = {
        'precision': precision_score(y_test, y_test_pred),
        'recall': recall_score(y_test, y_test_pred),
        'f1': f1_score(y_test, y_test_pred),
        'roc_auc': roc_auc_score(y_test, y_test_proba)
    }

    return train_metrics, valid_metrics, test_metrics

def print_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test):
    res = calc_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test)
    metrics = pd.DataFrame(res, index=['train', 'valid', 'test']).round(3)
    return metrics
# Функция для оценки модели
def calc_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test):

    # Обучение
    y_train_pred = model.predict(X_train)
    y_train_proba = model.predict_proba(X_train)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_train)

    # Валидация
    y_valid_pred = model.predict(X_valid)
    y_valid_proba = model.predict_proba(X_valid)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_valid)

    # Тестирование
    y_test_pred = model.predict(X_test)
    y_test_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_test)

    train_metrics = {
        'precision': precision_score(y_train, y_train_pred),
        'recall': recall_score(y_train, y_train_pred),
        'f1': f1_score(y_train, y_train_pred),
        'roc_auc': roc_auc_score(y_train, y_train_proba)
    }

    valid_metrics = {
        'precision': precision_score(y_valid, y_valid_pred),
        'recall': recall_score(y_valid, y_valid_pred),
        'f1': f1_score(y_valid, y_valid_pred),
        'roc_auc': roc_auc_score(y_valid, y_valid_proba)
    }

    test_metrics = {
        'precision': precision_score(y_test, y_test_pred),
        'recall': recall_score(y_test, y_test_pred),
        'f1': f1_score(y_test, y_test_pred),
        'roc_auc': roc_auc_score(y_test, y_test_proba)
    }

    return train_metrics, valid_metrics, test_metrics

def print_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test):
    res = calc_metrics(model, X_train, y_train, X_valid, y_valid, X_test, y_test)
    metrics = pd.DataFrame(res, index=['train', 'valid', 'test']).round(3)
    return metrics

7 | Маштабирование¶

Логистическая Регрессии¶

Оптимизация Гиперпараметров

Будем использовать стандартную переборку параметров с GridSearchCV

In [ ]:

Copied!





%%time

"""

Ручной перебор всех комбинации параметров

- наилучшию модель выбираем по метрике ROC-AUC

"""

param_grid = {
    'penalty': ['l2', 'l1', 'elasticnet'],      # Тип регурелизации
    # 'solver': ['lbfgs', 'liblinear', 'saga'],
    'C': np.linspace(0.001, 2, 50)  # Параметр который отвечает за вес этих поправок
}

grid_search = GridSearchCV(LogisticRegression(random_state=42),
                           param_grid,
                           cv=5,
                           scoring='roc_auc',
                           n_jobs=-1)

grid_search.fit(X_train_valid, y_train_valid)

best_logreg = grid_search.best_estimator_
print(best_logreg.get_params())
%%time

"""

Ручной перебор всех комбинации параметров

- наилучшию модель выбираем по метрике ROC-AUC

"""

param_grid = {
    'penalty': ['l2', 'l1', 'elasticnet'],      # Тип регурелизации
    # 'solver': ['lbfgs', 'liblinear', 'saga'],
    'C': np.linspace(0.001, 2, 50)  # Параметр который отвечает за вес этих поправок
}

grid_search = GridSearchCV(LogisticRegression(random_state=42),
                           param_grid,
                           cv=5,
                           scoring='roc_auc',
                           n_jobs=-1)

grid_search.fit(X_train_valid, y_train_valid)

best_logreg = grid_search.best_estimator_
print(best_logreg.get_params())

{'C': 0.04179591836734694, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
CPU times: user 632 ms, sys: 143 ms, total: 774 ms
Wall time: 7.9 s

In [ ]:

Copied!

res_logreg = print_metrics(best_logreg, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_logreg
res_logreg = print_metrics(best_logreg, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_logreg

Out[ ]:

	precision	recall	f1	roc_auc
train	0.615	0.201	0.303	0.774
valid	0.617	0.174	0.272	0.765
test	0.644	0.201	0.306	0.754

In [ ]:

Copied!





# Получение коэффициентов модели
coef = best_logreg.coef_[0]

# Создание DataFrame для коэффициентов и признаков
coef_df = pd.DataFrame({
    'Feature': X_train_valid.columns,
    'Coefficient_logreg': coef.round(3)
})

# Добавление столбца с интерпретацией
coef_df['Interpretation_logreg'] = coef_df['Coefficient_logreg'].apply(lambda x: 'Positive' if x > 0 else 'Negative')
# Получение коэффициентов модели
coef = best_logreg.coef_[0]

# Создание DataFrame для коэффициентов и признаков
coef_df = pd.DataFrame({
    'Feature': X_train_valid.columns,
    'Coefficient_logreg': coef.round(3)
})

# Добавление столбца с интерпретацией
coef_df['Interpretation_logreg'] = coef_df['Coefficient_logreg'].apply(lambda x: 'Positive' if x > 0 else 'Negative')

Интерпретация Важности Признаков

Для линейной модели важность признаков можно оценить по значению весов

Positive: Если коэффициент положительный, это означает, что увеличение значения признака увеличивает вероятность положительного исхода.
Negative: Если коэффициент отрицательный, это означает, что увеличение значения признака уменьшает вероятность положительного исхода.

In [ ]:

Copied!

coef_df
coef_df

Out[ ]:

	Feature	Coefficient_logreg	Interpretation_logreg
0	CreditScore	-0.064	Negative
1	Age	0.738	Positive
2	Tenure	-0.032	Negative
3	Balance	0.169	Positive
4	NumOfProducts	-0.128	Negative
5	HasCrCard	-0.019	Negative
6	IsActiveMember	-0.975	Negative
7	EstimatedSalary	0.047	Positive
8	Geography_Germany	0.707	Positive
9	Geography_Spain	0.021	Positive
10	Gender_Male	-0.505	Negative

Метод Опорных Векторов¶

In [ ]:

Copied!





%%time

"""

- Будем исходить из предположений что в данных есть нелинейность, проверим линейное ядро и полиномиальное
- probability = True : На выходе вероятностное расспределение

'kernel': 'poly' говорит нам о том что полиномиалное ядно дает лучше результат
Из этого мы однозначно делаем вывод что в данных есть нелинейность


"""

param_grid = {
    # 'C': np.linspace(0.001, 2, 50),
    # 'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'kernel': ['linear', 'poly']

}

grid_search = GridSearchCV(SVC(random_state=42, probability=True, gamma='scale'),
                           param_grid,
                           cv=5,
                           scoring='roc_auc',
                           n_jobs=-1)

grid_search.fit(X_train_valid, y_train_valid)

best_SVC = grid_search.best_estimator_
print(best_SVC.get_params())
%%time

"""

- Будем исходить из предположений что в данных есть нелинейность, проверим линейное ядро и полиномиальное
- probability = True : На выходе вероятностное расспределение

'kernel': 'poly' говорит нам о том что полиномиалное ядно дает лучше результат
Из этого мы однозначно делаем вывод что в данных есть нелинейность


"""

param_grid = {
    # 'C': np.linspace(0.001, 2, 50),
    # 'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'kernel': ['linear', 'poly']

}

grid_search = GridSearchCV(SVC(random_state=42, probability=True, gamma='scale'),
                           param_grid,
                           cv=5,
                           scoring='roc_auc',
                           n_jobs=-1)

grid_search.fit(X_train_valid, y_train_valid)

best_SVC = grid_search.best_estimator_
print(best_SVC.get_params())

{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'poly', 'max_iter': -1, 'probability': True, 'random_state': 42, 'shrinking': True, 'tol': 0.001, 'verbose': False}
CPU times: user 9.52 s, sys: 305 ms, total: 9.82 s
Wall time: 52.1 s

In [ ]:

Copied!

res_SVC = print_metrics(best_SVC, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_SVC
res_SVC = print_metrics(best_SVC, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_SVC

Out[ ]:

	precision	recall	f1	roc_auc
train	0.856	0.247	0.383	0.825
valid	0.885	0.246	0.385	0.841
test	0.850	0.241	0.376	0.812

Интерпритация Важность Признаков

Permutation Importance

Интерпретировать веса у полиномиальной модели как с линейной мы не сможем
Воспользуемся другим подходом permutation importance
Она смотрит на важность признака в контексте
- Приставим приннак h
- PI перемешивает все значение в этом столбце (получается что он испортил колонку)
- Обучает модель с учетом испорченной фичи
- запоминаем качество модели и запоминаем ее
- анологично делаем для всех остальных колонок
- чем сильнее падает качество при перемешенной колонки тем больше вклад признака в модели

In [ ]:

Copied!

from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(best_SVC, X_test, y_test)

features = np.array(X_test.columns)

sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(features[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance");
from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(best_SVC, X_test, y_test)

features = np.array(X_test.columns)

sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(features[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance");

Решающее Дерево¶

Параметры

criterion : Критерии на основе которого происходит разбиение на два класса , отвечает за характер разбиения
max_depth : максимальная глубина дерева
min_samples_leaf : Минимальное количество рядов в предсказаниях

In [ ]:

Copied!





%%time

param_grid = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    # 'splitter': ['best', 'random'],
    'max_depth': range(1, 11),
    # 'min_samples_split': range(2, 10),
    'min_samples_leaf': range(2, 10)
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
                           param_grid,
                           cv=5,
                           scoring='roc_auc',
                           n_jobs=-1)

grid_search.fit(X_train_valid, y_train_valid)

best_tree = grid_search.best_estimator_
print(best_tree.get_params())
%%time

param_grid = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    # 'splitter': ['best', 'random'],
    'max_depth': range(1, 11),
    # 'min_samples_split': range(2, 10),
    'min_samples_leaf': range(2, 10)
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
                           param_grid,
                           cv=5,
                           scoring='roc_auc',
                           n_jobs=-1)

grid_search.fit(X_train_valid, y_train_valid)

best_tree = grid_search.best_estimator_
print(best_tree.get_params())

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'entropy', 'max_depth': 6, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 9, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': 42, 'splitter': 'best'}
CPU times: user 1.17 s, sys: 127 ms, total: 1.29 s
Wall time: 31.1 s

In [ ]:

Copied!

res_tree = print_metrics(best_tree, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_tree
res_tree = print_metrics(best_tree, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_tree

Out[ ]:

	precision	recall	f1	roc_auc
train	0.767	0.485	0.594	0.868
valid	0.813	0.459	0.587	0.869
test	0.744	0.440	0.553	0.841

In [ ]:

Copied!





def plot_importance(model, X):

  # Получение важности признаков
  feature_importances = model.feature_importances_

  # Создание DataFrame для важности признаков
  importance_df = pd.DataFrame({
      'Feature': X.columns,
      'Importance': feature_importances
  })

  # Сортировка DataFrame по важности
  importance_df = importance_df.sort_values(by='Importance', ascending=False)

  # Визуализация важности признаков
  plt.figure(figsize=(10, 6))
  sns.barplot(x='Importance', y='Feature', data=importance_df)
  plt.title('Feature Importances in Decision Tree')
  plt.show()
def plot_importance(model, X):

  # Получение важности признаков
  feature_importances = model.feature_importances_

  # Создание DataFrame для важности признаков
  importance_df = pd.DataFrame({
      'Feature': X.columns,
      'Importance': feature_importances
  })

  # Сортировка DataFrame по важности
  importance_df = importance_df.sort_values(by='Importance', ascending=False)

  # Визуализация важности признаков
  plt.figure(figsize=(10, 6))
  sns.barplot(x='Importance', y='Feature', data=importance_df)
  plt.title('Feature Importances in Decision Tree')
  plt.show()

In [ ]:

Copied!

plot_importance(best_tree, X_train_valid)
plot_importance(best_tree, X_train_valid)

Случайный Лес¶

Оптимизация Гиперпараметров

оптимизатор для подбора параметров
оптимизатор использует байевскую оптимизацию

In [ ]:

Copied!

!pip install optuna -qqq
!pip install optuna -qqq

In [ ]:

Copied!





import optuna

# оптимизируем
def objective(trial):

    # гиперпараметры случайного леса
    param = {
        'criterion': trial.suggest_categorical("criterion", ['gini', 'entropy', 'log_loss']),
        'n_estimators': trial.suggest_int("n_estimators", 10, 100),
        "max_depth": trial.suggest_int("max_depth", 2, 10), # сократить кол-во деревьев
        'min_samples_split': trial.suggest_int("min_samples_split", 2, 10),
        'min_samples_leaf': trial.suggest_int("min_samples_leaf", 2, 10),

    }

    model = RandomForestClassifier(**param, random_state=random_state)
    model.fit(X_train, y_train)

    preds = model.predict_proba(X_valid)[:,1]
    auc = roc_auc_score(y_valid, preds)

    return auc
import optuna

# оптимизируем
def objective(trial):

    # гиперпараметры случайного леса
    param = {
        'criterion': trial.suggest_categorical("criterion", ['gini', 'entropy', 'log_loss']),
        'n_estimators': trial.suggest_int("n_estimators", 10, 100),
        "max_depth": trial.suggest_int("max_depth", 2, 10), # сократить кол-во деревьев
        'min_samples_split': trial.suggest_int("min_samples_split", 2, 10),
        'min_samples_leaf': trial.suggest_int("min_samples_leaf", 2, 10),

    }

    model = RandomForestClassifier(**param, random_state=random_state)
    model.fit(X_train, y_train)

    preds = model.predict_proba(X_valid)[:,1]
    auc = roc_auc_score(y_valid, preds)

    return auc

In [ ]:

Copied!





# максимизируем roc-auc
study = optuna.create_study(direction="maximize", 
                            study_name='RandomForestClassifier')
study.optimize(objective, n_trials=10) # попробовать увеличить

print("Number of finished trials: {}".format(len(study.trials)))

print("Best trial:")
trial = study.best_trial

print("Value: {}".format(trial.value))

print("Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))
# максимизируем roc-auc
study = optuna.create_study(direction="maximize", 
                            study_name='RandomForestClassifier')
study.optimize(objective, n_trials=10) # попробовать увеличить

print("Number of finished trials: {}".format(len(study.trials)))

print("Best trial:")
trial = study.best_trial

print("Value: {}".format(trial.value))

print("Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

[I 2024-08-14 03:46:14,901] A new study created in memory with name: RandomForestClassifier
[I 2024-08-14 03:46:15,008] Trial 0 finished with value: 0.8334212486754861 and parameters: {'criterion': 'log_loss', 'n_estimators': 15, 'max_depth': 5, 'min_samples_split': 8, 'min_samples_leaf': 8}. Best is trial 0 with value: 0.8334212486754861.
[I 2024-08-14 03:46:15,213] Trial 1 finished with value: 0.8172085799204443 and parameters: {'criterion': 'gini', 'n_estimators': 38, 'max_depth': 2, 'min_samples_split': 8, 'min_samples_leaf': 10}. Best is trial 0 with value: 0.8334212486754861.
[I 2024-08-14 03:46:15,408] Trial 2 finished with value: 0.8193139210088363 and parameters: {'criterion': 'log_loss', 'n_estimators': 17, 'max_depth': 3, 'min_samples_split': 5, 'min_samples_leaf': 9}. Best is trial 0 with value: 0.8334212486754861.
[I 2024-08-14 03:46:16,070] Trial 3 finished with value: 0.8264767078326399 and parameters: {'criterion': 'log_loss', 'n_estimators': 61, 'max_depth': 2, 'min_samples_split': 3, 'min_samples_leaf': 6}. Best is trial 0 with value: 0.8334212486754861.
[I 2024-08-14 03:46:16,818] Trial 4 finished with value: 0.8437150555794624 and parameters: {'criterion': 'log_loss', 'n_estimators': 53, 'max_depth': 5, 'min_samples_split': 2, 'min_samples_leaf': 7}. Best is trial 4 with value: 0.8437150555794624.
[I 2024-08-14 03:46:18,630] Trial 5 finished with value: 0.8498282566079176 and parameters: {'criterion': 'gini', 'n_estimators': 68, 'max_depth': 9, 'min_samples_split': 6, 'min_samples_leaf': 9}. Best is trial 5 with value: 0.8498282566079176.
[I 2024-08-14 03:46:18,932] Trial 6 finished with value: 0.8170481729803764 and parameters: {'criterion': 'log_loss', 'n_estimators': 13, 'max_depth': 3, 'min_samples_split': 9, 'min_samples_leaf': 7}. Best is trial 5 with value: 0.8498282566079176.
[I 2024-08-14 03:46:19,959] Trial 7 finished with value: 0.8418210197871214 and parameters: {'criterion': 'log_loss', 'n_estimators': 65, 'max_depth': 4, 'min_samples_split': 9, 'min_samples_leaf': 5}. Best is trial 5 with value: 0.8498282566079176.
[I 2024-08-14 03:46:20,197] Trial 8 finished with value: 0.8422798761781811 and parameters: {'criterion': 'log_loss', 'n_estimators': 15, 'max_depth': 6, 'min_samples_split': 8, 'min_samples_leaf': 4}. Best is trial 5 with value: 0.8498282566079176.
[I 2024-08-14 03:46:21,125] Trial 9 finished with value: 0.8426600714736308 and parameters: {'criterion': 'log_loss', 'n_estimators': 96, 'max_depth': 4, 'min_samples_split': 5, 'min_samples_leaf': 10}. Best is trial 5 with value: 0.8498282566079176.

Number of finished trials: 10
Best trial:
Value: 0.8498282566079176
Params: 
    criterion: gini
    n_estimators: 68
    max_depth: 9
    min_samples_split: 6
    min_samples_leaf: 9

In [ ]:

Copied!





# Обучаем модель с оптимальными параметрами 
best_forest = RandomForestClassifier(**study.best_params, random_state=random_state)
best_forest.fit(X_train_valid, y_train_valid)

res_forest = print_metrics(best_forest, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_forest
# Обучаем модель с оптимальными параметрами 
best_forest = RandomForestClassifier(**study.best_params, random_state=random_state)
best_forest.fit(X_train_valid, y_train_valid)

res_forest = print_metrics(best_forest, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_forest

Out[ ]:

	precision	recall	f1	roc_auc
train	0.863	0.463	0.603	0.914
valid	0.894	0.477	0.622	0.908
test	0.795	0.411	0.542	0.857

Интерпретация Важности Признаков

In [ ]:

Copied!

plot_importance(best_forest, X_train_valid)
plot_importance(best_forest, X_train_valid)

Градиентный Бустинг¶

Выбор гиперпараметров

learning_Rate : Скорость обучения, определяющая, насколько сильно обновляются веса модели на каждом шаге
max_depth : Максимальная глубина деревьев решений
l2_leaf_reg : Регуляризация L2 для листьев деревьев
subsample : Доля выборки, используемая для обучения каждого дерева
random_strength
- Параметр, отвечающий за случайность в процессе построения деревьев
- Он влияет на то, насколько сильно случайные изменения влияют на процесс разделения данных в каждом узле дерева
min_data_in_leaf : Минимальное количество данных, необходимых для создания листа в дереве.

Определение целевой функции

Нам нужно задать функцию которую мы будем оптимитизировать

In [ ]:

Copied!





from catboost import CatBoostClassifier

def objective(trial):
    param = {
        "learning_rate": trial.suggest_float('learning_rate', 0.01, 0.9),
        "max_depth": trial.suggest_int("max_depth", 2, 10),
        "l2_leaf_reg":trial.suggest_float('l2_leaf_reg', 0.01, 2),
        "subsample": trial.suggest_float('subsample', 0.01, 1),
        "random_strength": trial.suggest_float('random_strength', 1, 200),
        "min_data_in_leaf":trial.suggest_float('min_data_in_leaf', 1, 500)
    }

    cat = CatBoostClassifier(
        logging_level="Silent",
        eval_metric="AUC",
        grow_policy="Lossguide",
        random_seed=42,
        **param)
    cat.fit(X_train, y_train,
            eval_set=(X_valid, y_valid),
            verbose=False,
            early_stopping_rounds=10
           )

    preds = cat.predict_proba(X_valid)[:,1]
    auc = roc_auc_score(y_valid, preds)

    return auc
from catboost import CatBoostClassifier

def objective(trial):
    param = {
        "learning_rate": trial.suggest_float('learning_rate', 0.01, 0.9),
        "max_depth": trial.suggest_int("max_depth", 2, 10),
        "l2_leaf_reg":trial.suggest_float('l2_leaf_reg', 0.01, 2),
        "subsample": trial.suggest_float('subsample', 0.01, 1),
        "random_strength": trial.suggest_float('random_strength', 1, 200),
        "min_data_in_leaf":trial.suggest_float('min_data_in_leaf', 1, 500)
    }

    cat = CatBoostClassifier(
        logging_level="Silent",
        eval_metric="AUC",
        grow_policy="Lossguide",
        random_seed=42,
        **param)
    cat.fit(X_train, y_train,
            eval_set=(X_valid, y_valid),
            verbose=False,
            early_stopping_rounds=10
           )

    preds = cat.predict_proba(X_valid)[:,1]
    auc = roc_auc_score(y_valid, preds)

    return auc

Создаем объект исследования и запускаем оптимизацию

In [ ]:

Copied!





study = optuna.create_study(direction="maximize", study_name='CatBoostClassifier')
study.optimize(objective, n_trials=100) # попробуйте увеличить

print("Number of finished trials: {}".format(len(study.trials)))

print("Best trial:")
trial = study.best_trial

print("Value: {}".format(trial.value))

print("Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))
study = optuna.create_study(direction="maximize", study_name='CatBoostClassifier')
study.optimize(objective, n_trials=100) # попробуйте увеличить

print("Number of finished trials: {}".format(len(study.trials)))

print("Best trial:")
trial = study.best_trial

print("Value: {}".format(trial.value))

print("Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

[I 2024-08-14 03:46:44,732] A new study created in memory with name: CatBoostClassifier
[I 2024-08-14 03:46:44,948] Trial 0 finished with value: 0.5 and parameters: {'learning_rate': 0.2165534023121395, 'max_depth': 2, 'l2_leaf_reg': 1.5477322039068058, 'subsample': 0.029345419918240072, 'random_strength': 6.5809964961892495, 'min_data_in_leaf': 477.52416517082804}. Best is trial 0 with value: 0.5.
[I 2024-08-14 03:46:45,339] Trial 1 finished with value: 0.8372625321777863 and parameters: {'learning_rate': 0.789728112113311, 'max_depth': 4, 'l2_leaf_reg': 0.6004269201561995, 'subsample': 0.55359191980997, 'random_strength': 46.57861106355321, 'min_data_in_leaf': 425.7303912933811}. Best is trial 1 with value: 0.8372625321777863.
[I 2024-08-14 03:46:45,992] Trial 2 finished with value: 0.8485835604479672 and parameters: {'learning_rate': 0.4634724182719374, 'max_depth': 6, 'l2_leaf_reg': 0.575436612277608, 'subsample': 0.42186545862300934, 'random_strength': 123.97404853733563, 'min_data_in_leaf': 98.4208597017056}. Best is trial 2 with value: 0.8485835604479672.
[I 2024-08-14 03:46:48,201] Trial 3 finished with value: 0.8515032752320888 and parameters: {'learning_rate': 0.09543005782339105, 'max_depth': 9, 'l2_leaf_reg': 1.2167151901623765, 'subsample': 0.06245537706270228, 'random_strength': 61.228218388468974, 'min_data_in_leaf': 38.39023875452006}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:48,708] Trial 4 finished with value: 0.8406711796542305 and parameters: {'learning_rate': 0.7331013466248556, 'max_depth': 6, 'l2_leaf_reg': 0.012765794151882687, 'subsample': 0.14335624784813372, 'random_strength': 73.50485090264708, 'min_data_in_leaf': 380.00905654802966}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:50,233] Trial 5 finished with value: 0.8435615893243013 and parameters: {'learning_rate': 0.23370405202769556, 'max_depth': 9, 'l2_leaf_reg': 0.16497446248632272, 'subsample': 0.8752850397650734, 'random_strength': 111.64581392121151, 'min_data_in_leaf': 388.6064442254268}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:51,271] Trial 6 finished with value: 0.8226369667047633 and parameters: {'learning_rate': 0.04122993099237723, 'max_depth': 2, 'l2_leaf_reg': 0.20284874078763926, 'subsample': 0.6367717313128627, 'random_strength': 161.4305343264782, 'min_data_in_leaf': 18.271195536437006}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:51,778] Trial 7 finished with value: 0.8371075235482015 and parameters: {'learning_rate': 0.8582564268296371, 'max_depth': 7, 'l2_leaf_reg': 0.6159108475992469, 'subsample': 0.6430359063610354, 'random_strength': 194.77206434441388, 'min_data_in_leaf': 395.81326570058985}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:52,106] Trial 8 finished with value: 0.8514030208945464 and parameters: {'learning_rate': 0.6220500822136842, 'max_depth': 5, 'l2_leaf_reg': 0.6614202111677746, 'subsample': 0.3763622594650606, 'random_strength': 1.5372606636099502, 'min_data_in_leaf': 210.81684346685665}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:53,023] Trial 9 finished with value: 0.8379396345498039 and parameters: {'learning_rate': 0.7670675853970256, 'max_depth': 8, 'l2_leaf_reg': 1.8403233110251145, 'subsample': 0.33221049473597575, 'random_strength': 76.59908186125371, 'min_data_in_leaf': 220.27667189607422}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:53,795] Trial 10 finished with value: 0.8352975471619539 and parameters: {'learning_rate': 0.03939269976022153, 'max_depth': 10, 'l2_leaf_reg': 1.2239191289705484, 'subsample': 0.1971957977790358, 'random_strength': 45.97984119612045, 'min_data_in_leaf': 117.68547624545266}. Best is trial 3 with value: 0.8515032752320888.
[I 2024-08-14 03:46:54,182] Trial 11 finished with value: 0.8534204466407855 and parameters: {'learning_rate': 0.5292947899032613, 'max_depth': 4, 'l2_leaf_reg': 1.0562715822887339, 'subsample': 0.29238467023057874, 'random_strength': 1.0080155926863474, 'min_data_in_leaf': 217.05912068009718}. Best is trial 11 with value: 0.8534204466407855.
[I 2024-08-14 03:46:54,925] Trial 12 finished with value: 0.854156159240905 and parameters: {'learning_rate': 0.46667965752333035, 'max_depth': 4, 'l2_leaf_reg': 1.1034539045748595, 'subsample': 0.2177545543283828, 'random_strength': 33.69191424519235, 'min_data_in_leaf': 287.6354438748095}. Best is trial 12 with value: 0.854156159240905.
[I 2024-08-14 03:46:55,261] Trial 13 finished with value: 0.8481239328696957 and parameters: {'learning_rate': 0.47388030842524104, 'max_depth': 4, 'l2_leaf_reg': 0.9841376098092054, 'subsample': 0.23796767505443062, 'random_strength': 30.559857725987595, 'min_data_in_leaf': 304.8772947881302}. Best is trial 12 with value: 0.854156159240905.
[I 2024-08-14 03:46:55,774] Trial 14 finished with value: 0.8512302749590884 and parameters: {'learning_rate': 0.5677435867325714, 'max_depth': 4, 'l2_leaf_reg': 0.9858935858858828, 'subsample': 0.2880787724849706, 'random_strength': 19.641914653188316, 'min_data_in_leaf': 296.50997464128113}. Best is trial 12 with value: 0.854156159240905.
[I 2024-08-14 03:46:56,213] Trial 15 finished with value: 0.8494503748741036 and parameters: {'learning_rate': 0.37166572152717614, 'max_depth': 3, 'l2_leaf_reg': 1.4583249930757494, 'subsample': 0.9955315242420071, 'random_strength': 88.02174705220291, 'min_data_in_leaf': 179.3738772355934}. Best is trial 12 with value: 0.854156159240905.
[I 2024-08-14 03:46:57,055] Trial 16 finished with value: 0.8559885000562966 and parameters: {'learning_rate': 0.3210310795870268, 'max_depth': 5, 'l2_leaf_reg': 1.9094658324753917, 'subsample': 0.4804851220395184, 'random_strength': 33.49522734292147, 'min_data_in_leaf': 287.9109110478402}. Best is trial 16 with value: 0.8559885000562966.
[I 2024-08-14 03:46:57,763] Trial 17 finished with value: 0.847795407117441 and parameters: {'learning_rate': 0.35378097903239786, 'max_depth': 5, 'l2_leaf_reg': 1.9401989755988598, 'subsample': 0.48416569735771703, 'random_strength': 40.07318375520298, 'min_data_in_leaf': 303.24367032588754}. Best is trial 16 with value: 0.8559885000562966.
[I 2024-08-14 03:46:58,649] Trial 18 finished with value: 0.8543381594229051 and parameters: {'learning_rate': 0.3270310046948177, 'max_depth': 5, 'l2_leaf_reg': 1.6932697259348661, 'subsample': 0.79447929672279, 'random_strength': 136.12105973617452, 'min_data_in_leaf': 329.93271902824506}. Best is trial 16 with value: 0.8559885000562966.
[I 2024-08-14 03:46:59,576] Trial 19 finished with value: 0.8500765788901381 and parameters: {'learning_rate': 0.2708829352692093, 'max_depth': 7, 'l2_leaf_reg': 1.7038356958216352, 'subsample': 0.7632729569486497, 'random_strength': 138.4597489310068, 'min_data_in_leaf': 343.8985733074137}. Best is trial 16 with value: 0.8559885000562966.
[I 2024-08-14 03:47:00,116] Trial 20 finished with value: 0.8531150565048871 and parameters: {'learning_rate': 0.3626267613161781, 'max_depth': 5, 'l2_leaf_reg': 1.9795846165177058, 'subsample': 0.7827032985899299, 'random_strength': 142.73839861496384, 'min_data_in_leaf': 493.0493076589869}. Best is trial 16 with value: 0.8559885000562966.
[I 2024-08-14 03:47:00,940] Trial 21 finished with value: 0.8563910597808904 and parameters: {'learning_rate': 0.16961344589594285, 'max_depth': 3, 'l2_leaf_reg': 1.649834982836516, 'subsample': 0.5749517198100278, 'random_strength': 99.89102990362149, 'min_data_in_leaf': 271.7012421453372}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:02,167] Trial 22 finished with value: 0.8344091394938851 and parameters: {'learning_rate': 0.14552950442407062, 'max_depth': 3, 'l2_leaf_reg': 1.6649232359966202, 'subsample': 0.5985514802507333, 'random_strength': 169.99871764933434, 'min_data_in_leaf': 261.3080549135861}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:04,114] Trial 23 finished with value: 0.8545186172304817 and parameters: {'learning_rate': 0.15793176071004453, 'max_depth': 3, 'l2_leaf_reg': 1.3526112935873842, 'subsample': 0.7175761381145475, 'random_strength': 113.99285178004443, 'min_data_in_leaf': 352.6566495237683}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:06,560] Trial 24 finished with value: 0.8553638384146858 and parameters: {'learning_rate': 0.1556921373541092, 'max_depth': 3, 'l2_leaf_reg': 1.4115008286364423, 'subsample': 0.480391200689659, 'random_strength': 100.39425793454086, 'min_data_in_leaf': 350.01711545220127}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:08,806] Trial 25 finished with value: 0.8480560683950514 and parameters: {'learning_rate': 0.16409141193428203, 'max_depth': 2, 'l2_leaf_reg': 1.496137374339839, 'subsample': 0.46620575811523923, 'random_strength': 91.45570082665543, 'min_data_in_leaf': 163.57983997102036}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:10,275] Trial 26 finished with value: 0.8544893121164308 and parameters: {'learning_rate': 0.2837819547192468, 'max_depth': 3, 'l2_leaf_reg': 1.8020953679362315, 'subsample': 0.5189916467893968, 'random_strength': 101.08298523619007, 'min_data_in_leaf': 253.22233056776207}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:10,680] Trial 27 finished with value: 0.8211246685822957 and parameters: {'learning_rate': 0.1044556356752651, 'max_depth': 2, 'l2_leaf_reg': 1.3459091803909486, 'subsample': 0.431619565523395, 'random_strength': 59.294142417791434, 'min_data_in_leaf': 449.8387981624195}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:10,968] Trial 28 finished with value: 0.8035146085993543 and parameters: {'learning_rate': 0.21613547632724206, 'max_depth': 3, 'l2_leaf_reg': 1.6092822334057308, 'subsample': 0.5676045993284018, 'random_strength': 72.80957980792742, 'min_data_in_leaf': 263.0893593280764}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:11,494] Trial 29 finished with value: 0.8547792785080921 and parameters: {'learning_rate': 0.2086592133839792, 'max_depth': 2, 'l2_leaf_reg': 0.819697698152623, 'subsample': 0.6834571224386156, 'random_strength': 19.984681853813314, 'min_data_in_leaf': 453.6832846244588}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:12,284] Trial 30 finished with value: 0.8367967351018197 and parameters: {'learning_rate': 0.3944960136703665, 'max_depth': 7, 'l2_leaf_reg': 1.8676872612497761, 'subsample': 0.38308326997649067, 'random_strength': 154.4491135123389, 'min_data_in_leaf': 332.6615450443436}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:12,895] Trial 31 finished with value: 0.8546836512938208 and parameters: {'learning_rate': 0.21596851649554094, 'max_depth': 2, 'l2_leaf_reg': 0.8664350791512772, 'subsample': 0.6869620770295259, 'random_strength': 24.469562538665006, 'min_data_in_leaf': 444.9760164258388}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:13,411] Trial 32 finished with value: 0.8548571684164904 and parameters: {'learning_rate': 0.28016951231760356, 'max_depth': 2, 'l2_leaf_reg': 0.806914526399829, 'subsample': 0.5400321582582661, 'random_strength': 16.595337724789662, 'min_data_in_leaf': 425.5676845159874}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:14,332] Trial 33 finished with value: 0.8465121515968972 and parameters: {'learning_rate': 0.2943832841669326, 'max_depth': 3, 'l2_leaf_reg': 1.4002865433211302, 'subsample': 0.5244965679173971, 'random_strength': 55.18009770759243, 'min_data_in_leaf': 411.0478307080825}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:15,095] Trial 34 finished with value: 0.8532785481938023 and parameters: {'learning_rate': 0.41204275970364945, 'max_depth': 4, 'l2_leaf_reg': 1.5580897723724074, 'subsample': 0.588510970323773, 'random_strength': 119.79281757796326, 'min_data_in_leaf': 388.1140383915803}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:15,717] Trial 35 finished with value: 0.8286869303818456 and parameters: {'learning_rate': 0.08975166153572325, 'max_depth': 2, 'l2_leaf_reg': 1.2419086491253597, 'subsample': 0.44346103753859356, 'random_strength': 12.29947978876556, 'min_data_in_leaf': 364.58824440904306}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:16,582] Trial 36 finished with value: 0.852669310296429 and parameters: {'learning_rate': 0.250033922012678, 'max_depth': 6, 'l2_leaf_reg': 0.42864672108139, 'subsample': 0.5268262007033628, 'random_strength': 95.42973443090546, 'min_data_in_leaf': 429.3365607018824}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:17,024] Trial 37 finished with value: 0.8127665415801009 and parameters: {'learning_rate': 0.1691762432733271, 'max_depth': 3, 'l2_leaf_reg': 1.7829743976227164, 'subsample': 0.3765836616592204, 'random_strength': 105.88486237007761, 'min_data_in_leaf': 475.9380950106968}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:17,801] Trial 38 finished with value: 0.852903751208836 and parameters: {'learning_rate': 0.3144852057758997, 'max_depth': 5, 'l2_leaf_reg': 0.7468205513730214, 'subsample': 0.6288168983269199, 'random_strength': 81.03884687055434, 'min_data_in_leaf': 278.02630322886773}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:18,104] Trial 39 finished with value: 0.7429039208700227 and parameters: {'learning_rate': 0.0916001703115347, 'max_depth': 2, 'l2_leaf_reg': 0.3209407231862438, 'subsample': 0.48947279248425846, 'random_strength': 63.425336876240344, 'min_data_in_leaf': 312.50209443952326}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:18,299] Trial 40 finished with value: 0.5 and parameters: {'learning_rate': 0.6819438096729318, 'max_depth': 6, 'l2_leaf_reg': 1.1696390615306496, 'subsample': 0.02690876704491979, 'random_strength': 186.05663001448391, 'min_data_in_leaf': 237.24363431981527}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:18,628] Trial 41 finished with value: 0.8275224376919292 and parameters: {'learning_rate': 0.20539423781999722, 'max_depth': 2, 'l2_leaf_reg': 0.7875365906927447, 'subsample': 0.7197475219510165, 'random_strength': 14.4847764719263, 'min_data_in_leaf': 469.9772033903035}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:19,489] Trial 42 finished with value: 0.8467774399977791 and parameters: {'learning_rate': 0.11258151909406194, 'max_depth': 2, 'l2_leaf_reg': 0.8721010222525916, 'subsample': 0.6594836798807467, 'random_strength': 48.89095313702967, 'min_data_in_leaf': 402.73822377455843}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:20,194] Trial 43 finished with value: 0.8518981230845636 and parameters: {'learning_rate': 0.19221299265782907, 'max_depth': 3, 'l2_leaf_reg': 0.4867586865668939, 'subsample': 0.5571688603077394, 'random_strength': 31.93422189548409, 'min_data_in_leaf': 363.31928598127024}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:22,996] Trial 44 finished with value: 0.8487424250136115 and parameters: {'learning_rate': 0.039144856136337974, 'max_depth': 4, 'l2_leaf_reg': 0.9009428480096298, 'subsample': 0.8375833862722262, 'random_strength': 6.814579833825066, 'min_data_in_leaf': 432.94909658576285}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:23,939] Trial 45 finished with value: 0.8486205774341369 and parameters: {'learning_rate': 0.2510132838894006, 'max_depth': 2, 'l2_leaf_reg': 0.6907636418579387, 'subsample': 0.6280995182686069, 'random_strength': 24.31951969212401, 'min_data_in_leaf': 374.4772844339267}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:25,057] Trial 46 finished with value: 0.8486591367947299 and parameters: {'learning_rate': 0.4177823887952824, 'max_depth': 3, 'l2_leaf_reg': 0.5837206207668852, 'subsample': 0.10938321642670351, 'random_strength': 41.931702968776605, 'min_data_in_leaf': 178.84495282770186}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:25,696] Trial 47 finished with value: 0.823857756061146 and parameters: {'learning_rate': 0.019782576463977575, 'max_depth': 4, 'l2_leaf_reg': 1.9024552651741735, 'subsample': 0.40297843167975866, 'random_strength': 66.82247683667727, 'min_data_in_leaf': 316.8543796002272}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:26,796] Trial 48 finished with value: 0.8527418018943443 and parameters: {'learning_rate': 0.12284365746995335, 'max_depth': 8, 'l2_leaf_reg': 1.5884344189532063, 'subsample': 0.3276266788784489, 'random_strength': 51.29989880997462, 'min_data_in_leaf': 459.0054658582284}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:27,241] Trial 49 finished with value: 0.8540805828941422 and parameters: {'learning_rate': 0.3299775712117529, 'max_depth': 3, 'l2_leaf_reg': 1.0878543895513753, 'subsample': 0.6942586059012404, 'random_strength': 11.713670742817778, 'min_data_in_leaf': 495.2165248684829}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:27,758] Trial 50 finished with value: 0.8472201014573897 and parameters: {'learning_rate': 0.25324978223894085, 'max_depth': 4, 'l2_leaf_reg': 1.7363552106408617, 'subsample': 0.5877810779665728, 'random_strength': 37.972565443197155, 'min_data_in_leaf': 412.6914975963677}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:28,249] Trial 51 finished with value: 0.8474005592649662 and parameters: {'learning_rate': 0.19612243767630727, 'max_depth': 2, 'l2_leaf_reg': 0.8917975193826132, 'subsample': 0.6848753748976462, 'random_strength': 24.324814512268816, 'min_data_in_leaf': 436.96437349015264}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:28,631] Trial 52 finished with value: 0.8247153162407398 and parameters: {'learning_rate': 0.06857413116836153, 'max_depth': 2, 'l2_leaf_reg': 0.8093899327079714, 'subsample': 0.8856644877809534, 'random_strength': 22.15656724840708, 'min_data_in_leaf': 454.16888064861433}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:28,958] Trial 53 finished with value: 0.8552512450817537 and parameters: {'learning_rate': 0.2277359380751674, 'max_depth': 2, 'l2_leaf_reg': 1.0044730574652196, 'subsample': 0.5426082432903984, 'random_strength': 2.092346231014936, 'min_data_in_leaf': 283.86020680504697}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:29,458] Trial 54 finished with value: 0.8545217019793291 and parameters: {'learning_rate': 0.3128193984950219, 'max_depth': 3, 'l2_leaf_reg': 1.008665858224686, 'subsample': 0.47370734259106645, 'random_strength': 6.662500258650947, 'min_data_in_leaf': 283.74741280019464}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:29,754] Trial 55 finished with value: 0.8421464607905287 and parameters: {'learning_rate': 0.8773525297785807, 'max_depth': 2, 'l2_leaf_reg': 1.149059206077042, 'subsample': 0.5453409140242778, 'random_strength': 128.55684672230558, 'min_data_in_leaf': 233.31803747702202}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:30,145] Trial 56 finished with value: 0.8550931517033213 and parameters: {'learning_rate': 0.13656066626789937, 'max_depth': 2, 'l2_leaf_reg': 1.2919909470689186, 'subsample': 0.43933745123796614, 'random_strength': 1.7054069646826093, 'min_data_in_leaf': 62.17289685684315}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:30,493] Trial 57 finished with value: 0.8527880731270562 and parameters: {'learning_rate': 0.13041630646518754, 'max_depth': 10, 'l2_leaf_reg': 1.274378826982479, 'subsample': 0.34334490211186264, 'random_strength': 4.2565201685199305, 'min_data_in_leaf': 71.55126556167352}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:30,682] Trial 58 finished with value: 0.8514184446387836 and parameters: {'learning_rate': 0.16594689210860503, 'max_depth': 3, 'l2_leaf_reg': 1.4888320186002333, 'subsample': 0.4376508602553443, 'random_strength': 1.3885552941461277, 'min_data_in_leaf': 152.70173531249966}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:31,007] Trial 59 finished with value: 0.8540620744010574 and parameters: {'learning_rate': 0.2750010284884532, 'max_depth': 5, 'l2_leaf_reg': 1.2746287691142977, 'subsample': 0.5071138630239854, 'random_strength': 16.021246309606525, 'min_data_in_leaf': 8.92644330176222}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:31,275] Trial 60 finished with value: 0.8236294846464337 and parameters: {'learning_rate': 0.06580145985259861, 'max_depth': 4, 'l2_leaf_reg': 1.4207580932038155, 'subsample': 0.2559732452953547, 'random_strength': 34.10123336167225, 'min_data_in_leaf': 35.04352061528167}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:31,513] Trial 61 finished with value: 0.8484570857452214 and parameters: {'learning_rate': 0.23125001722207209, 'max_depth': 2, 'l2_leaf_reg': 0.951125702424557, 'subsample': 0.45236883411477885, 'random_strength': 16.37629361091796, 'min_data_in_leaf': 276.32032283099494}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:31,848] Trial 62 finished with value: 0.8491604084824422 and parameters: {'learning_rate': 0.18521171522349617, 'max_depth': 2, 'l2_leaf_reg': 1.0233833614681076, 'subsample': 0.6028311345426446, 'random_strength': 27.84478190364073, 'min_data_in_leaf': 212.63167127467798}. Best is trial 21 with value: 0.8563910597808904.
[I 2024-08-14 03:47:32,139] Trial 63 finished with value: 0.8579966715559936 and parameters: {'learning_rate': 0.2275828875803787, 'max_depth': 2, 'l2_leaf_reg': 1.649969257251929, 'subsample': 0.41432801712540723, 'random_strength': 9.871440937029153, 'min_data_in_leaf': 339.4206240957627}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:32,353] Trial 64 finished with value: 0.8410583156345868 and parameters: {'learning_rate': 0.15443576863058872, 'max_depth': 3, 'l2_leaf_reg': 1.6356148561301187, 'subsample': 0.40829684886366496, 'random_strength': 8.920803659606767, 'min_data_in_leaf': 329.25270044657947}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:32,559] Trial 65 finished with value: 0.8475794746981188 and parameters: {'learning_rate': 0.52262191640972, 'max_depth': 2, 'l2_leaf_reg': 1.7570515967211897, 'subsample': 0.35209745591260677, 'random_strength': 79.31033248169595, 'min_data_in_leaf': 298.8152094762548}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:32,817] Trial 66 finished with value: 0.8545409816596259 and parameters: {'learning_rate': 0.3397594287720047, 'max_depth': 3, 'l2_leaf_reg': 1.51439591232287, 'subsample': 0.48690121670146136, 'random_strength': 113.64174712891179, 'min_data_in_leaf': 344.9705847573752}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:33,180] Trial 67 finished with value: 0.8527741917572427 and parameters: {'learning_rate': 0.2872741864314234, 'max_depth': 6, 'l2_leaf_reg': 1.9758241623536021, 'subsample': 0.30706088302917706, 'random_strength': 86.69768770901611, 'min_data_in_leaf': 195.60397604666105}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:33,542] Trial 68 finished with value: 0.8388542625830762 and parameters: {'learning_rate': 0.37734226459157394, 'max_depth': 7, 'l2_leaf_reg': 1.84386701582687, 'subsample': 0.5416787382022203, 'random_strength': 102.93491264871817, 'min_data_in_leaf': 100.73573695821503}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:33,798] Trial 69 finished with value: 0.8491079677520357 and parameters: {'learning_rate': 0.2396158685166121, 'max_depth': 3, 'l2_leaf_reg': 1.3497075412335753, 'subsample': 0.41533858440585675, 'random_strength': 9.011802604306009, 'min_data_in_leaf': 244.2984153117823}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:34,249] Trial 70 finished with value: 0.8551000923882279 and parameters: {'learning_rate': 0.1405771071007593, 'max_depth': 2, 'l2_leaf_reg': 1.6670425816135785, 'subsample': 0.5628243022632802, 'random_strength': 43.63893059165331, 'min_data_in_leaf': 265.0039335083252}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:34,463] Trial 71 finished with value: 0.8369964725896929 and parameters: {'learning_rate': 0.1346261793235442, 'max_depth': 2, 'l2_leaf_reg': 1.69505852020515, 'subsample': 0.5662727341567123, 'random_strength': 18.17730751314747, 'min_data_in_leaf': 261.9534789983721}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:34,721] Trial 72 finished with value: 0.8321156287257981 and parameters: {'learning_rate': 0.07724194746378436, 'max_depth': 2, 'l2_leaf_reg': 1.6529774456888673, 'subsample': 0.5130735404504633, 'random_strength': 43.04440541458727, 'min_data_in_leaf': 271.8748406602916}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:34,936] Trial 73 finished with value: 0.8534474381932009 and parameters: {'learning_rate': 0.2193541378472138, 'max_depth': 2, 'l2_leaf_reg': 1.55127358733355, 'subsample': 0.45844426802461946, 'random_strength': 1.2277571480166767, 'min_data_in_leaf': 323.54337938841485}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:35,129] Trial 74 finished with value: 0.8245140363784432 and parameters: {'learning_rate': 0.17096273713989574, 'max_depth': 3, 'l2_leaf_reg': 1.4299055079718532, 'subsample': 0.6103423491223143, 'random_strength': 35.52778769358011, 'min_data_in_leaf': 309.4874282846581}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:35,471] Trial 75 finished with value: 0.849687900535358 and parameters: {'learning_rate': 0.1424757474479632, 'max_depth': 2, 'l2_leaf_reg': 1.9040100631862928, 'subsample': 0.49178296476749883, 'random_strength': 29.861178645449915, 'min_data_in_leaf': 288.4169959906686}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:35,653] Trial 76 finished with value: 0.8166471556302065 and parameters: {'learning_rate': 0.10900450958876559, 'max_depth': 3, 'l2_leaf_reg': 1.1584526059635818, 'subsample': 0.5436945071994185, 'random_strength': 97.11608635264365, 'min_data_in_leaf': 252.08827881313073}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:36,111] Trial 77 finished with value: 0.8529130054553784 and parameters: {'learning_rate': 0.26344618455418334, 'max_depth': 2, 'l2_leaf_reg': 1.8036914133857063, 'subsample': 0.5759920935180337, 'random_strength': 149.97345964123704, 'min_data_in_leaf': 230.53437749650683}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:36,464] Trial 78 finished with value: 0.8554301605149063 and parameters: {'learning_rate': 0.30386516642561456, 'max_depth': 2, 'l2_leaf_reg': 1.5941173743706698, 'subsample': 0.388487385094279, 'random_strength': 12.04964504414562, 'min_data_in_leaf': 340.15331434948547}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:36,925] Trial 79 finished with value: 0.8556607454912538 and parameters: {'learning_rate': 0.3043185361985488, 'max_depth': 3, 'l2_leaf_reg': 1.6062543081685263, 'subsample': 0.38163010145401816, 'random_strength': 10.801512721309756, 'min_data_in_leaf': 336.9094072537885}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:37,511] Trial 80 finished with value: 0.8507459693900372 and parameters: {'learning_rate': 0.2967690808832415, 'max_depth': 4, 'l2_leaf_reg': 1.5769102540118054, 'subsample': 0.3683396074299298, 'random_strength': 119.50189785234694, 'min_data_in_leaf': 352.7530072728084}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:37,788] Trial 81 finished with value: 0.8495382902162564 and parameters: {'learning_rate': 0.3519982264605916, 'max_depth': 3, 'l2_leaf_reg': 1.491074642500498, 'subsample': 0.39339814284208235, 'random_strength': 9.046452450187513, 'min_data_in_leaf': 338.6181638900527}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:38,153] Trial 82 finished with value: 0.8492305865187221 and parameters: {'learning_rate': 0.442607098792248, 'max_depth': 2, 'l2_leaf_reg': 1.7262120187355963, 'subsample': 0.27234819741871114, 'random_strength': 12.955991756657548, 'min_data_in_leaf': 381.46339177847165}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:38,943] Trial 83 finished with value: 0.8509557323116645 and parameters: {'learning_rate': 0.18048848769077014, 'max_depth': 9, 'l2_leaf_reg': 1.6096921762481307, 'subsample': 0.31232826961375537, 'random_strength': 27.6479977169267, 'min_data_in_leaf': 362.9959768769905}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:39,536] Trial 84 finished with value: 0.8530302259115817 and parameters: {'learning_rate': 0.22708933829872321, 'max_depth': 3, 'l2_leaf_reg': 1.6632316572246608, 'subsample': 0.4206405376875545, 'random_strength': 20.498316514413297, 'min_data_in_leaf': 297.40518856402844}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:40,270] Trial 85 finished with value: 0.844975946670862 and parameters: {'learning_rate': 0.3134824967461136, 'max_depth': 8, 'l2_leaf_reg': 1.458604426862953, 'subsample': 0.46636325980991755, 'random_strength': 172.65958600142233, 'min_data_in_leaf': 308.2224094733233}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:40,626] Trial 86 finished with value: 0.853733548648803 and parameters: {'learning_rate': 0.20516007929387697, 'max_depth': 4, 'l2_leaf_reg': 1.8081881256255257, 'subsample': 0.3669610320292627, 'random_strength': 71.7353570023819, 'min_data_in_leaf': 349.71334841544535}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:40,844] Trial 87 finished with value: 0.847503898351356 and parameters: {'learning_rate': 0.3850725922826087, 'max_depth': 2, 'l2_leaf_reg': 1.3738184718048245, 'subsample': 0.4394316413220218, 'random_strength': 13.073971750314525, 'min_data_in_leaf': 319.76257608276416}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:41,130] Trial 88 finished with value: 0.8509834950512916 and parameters: {'learning_rate': 0.2486495279955683, 'max_depth': 3, 'l2_leaf_reg': 1.3060721511513962, 'subsample': 0.5036611916707044, 'random_strength': 6.582515968741947, 'min_data_in_leaf': 288.24916069068973}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:41,464] Trial 89 finished with value: 0.8411153834882648 and parameters: {'learning_rate': 0.05111799564446984, 'max_depth': 2, 'l2_leaf_reg': 1.537356157859457, 'subsample': 0.3880066956376303, 'random_strength': 4.295324207554229, 'min_data_in_leaf': 266.43818936484587}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:41,615] Trial 90 finished with value: 0.809689504604759 and parameters: {'learning_rate': 0.14849530714748765, 'max_depth': 2, 'l2_leaf_reg': 1.604654407357084, 'subsample': 0.6488514855233791, 'random_strength': 109.07773151115863, 'min_data_in_leaf': 325.72222318586404}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:41,889] Trial 91 finished with value: 0.8535808535808537 and parameters: {'learning_rate': 0.2768348084629538, 'max_depth': 2, 'l2_leaf_reg': 1.7020206569000091, 'subsample': 0.5270113119832779, 'random_strength': 19.279543505451198, 'min_data_in_leaf': 390.8245718531585}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:42,108] Trial 92 finished with value: 0.8445248021519208 and parameters: {'learning_rate': 0.2980320322130359, 'max_depth': 2, 'l2_leaf_reg': 0.9413144021105265, 'subsample': 0.18491093410744158, 'random_strength': 23.462444073545914, 'min_data_in_leaf': 371.68996169817217}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:42,275] Trial 93 finished with value: 0.8438677506474116 and parameters: {'learning_rate': 0.8018049022359041, 'max_depth': 3, 'l2_leaf_reg': 0.7207307279562393, 'subsample': 0.4817377653790264, 'random_strength': 54.11215711265974, 'min_data_in_leaf': 247.83798275985313}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:42,629] Trial 94 finished with value: 0.8526831916662426 and parameters: {'learning_rate': 0.18976503763926902, 'max_depth': 5, 'l2_leaf_reg': 1.0778005788536056, 'subsample': 0.42051153997605584, 'random_strength': 15.934790667198342, 'min_data_in_leaf': 418.39615211072214}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:42,894] Trial 95 finished with value: 0.8465136939713211 and parameters: {'learning_rate': 0.3330384372789757, 'max_depth': 2, 'l2_leaf_reg': 1.9197824427277657, 'subsample': 0.5366007428124753, 'random_strength': 44.67016583545405, 'min_data_in_leaf': 200.40708572900584}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:43,205] Trial 96 finished with value: 0.8505886471988167 and parameters: {'learning_rate': 0.26241859653774113, 'max_depth': 3, 'l2_leaf_reg': 0.6304145123151321, 'subsample': 0.4542962823423417, 'random_strength': 37.17578672829132, 'min_data_in_leaf': 399.3636999400437}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:43,443] Trial 97 finished with value: 0.8340196899518935 and parameters: {'learning_rate': 0.09242477857682523, 'max_depth': 2, 'l2_leaf_reg': 0.0645769669423184, 'subsample': 0.5820960690872515, 'random_strength': 11.65162235424116, 'min_data_in_leaf': 358.5938499475734}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:43,772] Trial 98 finished with value: 0.8560116356726526 and parameters: {'learning_rate': 0.2238295228087367, 'max_depth': 2, 'l2_leaf_reg': 1.4538002550468356, 'subsample': 0.6177874774515071, 'random_strength': 26.070466696158935, 'min_data_in_leaf': 336.943504306358}. Best is trial 63 with value: 0.8579966715559936.
[I 2024-08-14 03:47:44,246] Trial 99 finished with value: 0.8533078533078534 and parameters: {'learning_rate': 0.12342253771037937, 'max_depth': 3, 'l2_leaf_reg': 1.446914784821623, 'subsample': 0.6193773627486576, 'random_strength': 32.4812721291412, 'min_data_in_leaf': 336.8229861132391}. Best is trial 63 with value: 0.8579966715559936.

Number of finished trials: 100
Best trial:
Value: 0.8579966715559936
Params: 
    learning_rate: 0.2275828875803787
    max_depth: 2
    l2_leaf_reg: 1.649969257251929
    subsample: 0.41432801712540723
    random_strength: 9.871440937029153
    min_data_in_leaf: 339.4206240957627

In [ ]:

Copied!





best_cat = CatBoostClassifier(**study.best_params, random_state=random_state)
best_cat.fit(X_train, y_train,
            eval_set=(X_valid, y_valid),
            verbose=False,
             early_stopping_rounds=10
           )

res_cat = print_metrics(best_cat, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_cat
best_cat = CatBoostClassifier(**study.best_params, random_state=random_state)
best_cat.fit(X_train, y_train,
            eval_set=(X_valid, y_valid),
            verbose=False,
             early_stopping_rounds=10
           )

res_cat = print_metrics(best_cat, X_train, y_train, X_valid, y_valid, X_test, y_test)
res_cat

Out[ ]:

	precision	recall	f1	roc_auc
train	0.786	0.468	0.587	0.877
valid	0.825	0.464	0.594	0.855
test	0.754	0.442	0.557	0.856

Интерпретация Важности Признаков

Для градиентного спуска можно использовать SHAP значении для интерпретации важности признаков

Чем выше признак, тем он важнее
Чем краснее точка, тем выше значение признака

SHAP value

Значения SHAP представляют собой вклад каждого признака в прогноз модели.
Положительные значения SHAP указывают на то, что признак увеличивает прогноз (в сторону положительного класса; отток клиента)

Расспределение Точек

Распределение точек для каждого признака показывает изменчивость влияния этого признака на прогнозы для различных экземпляров.
Более широкое распределение указывает на то, что признак имеет разное влияние в зависимости от других факторов.

In [ ]:

Copied!

import shap
explainer = shap.TreeExplainer(best_cat)
shap_values = explainer(X_train_valid)
import shap
explainer = shap.TreeExplainer(best_cat)
shap_values = explainer(X_train_valid)

In [ ]:

Copied!

shap.plots.beeswarm(shap_values)
shap.plots.beeswarm(shap_values)