Ovarian phase classification in felids
pip install -U kaleido
Collecting kaleido
Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
|████████████████████████████████| 79.9 MB 99.5 MB/s
Installing collected packages: kaleido
Successfully installed kaleido-0.2.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Note: you may need to restart the kernel to use updated packages.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, roc_auc_score
from plotly.subplots import make_subplots
from sklearn.metrics import confusion_matrix,plot_confusion_matrix
import plotly.graph_objects as go
import plotly.figure_factory as ff
from catboost import CatBoostClassifier
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as py
py.init_notebook_mode(connected=True)
import os
import warnings
warnings.filterwarnings("ignore")
/opt/conda/lib/python3.7/site-packages/geopandas/_compat.py:115: UserWarning: The Shapely GEOS version (3.9.1-CAPI-1.14.2) is incompatible with the GEOS version PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both will be slow. shapely_geos_version, geos_capi_version_string
¶
❯❯ MACHINE LEARNING & VETERINARY SCIENCE¶
- There are definite parallels between veterinary science and human medical science
- Probably due to funding, veterinary science has tended to lack the same degree of ML,DS & AI incorporation as the human medical field
- Even on Kaggle, we can note the large number of medical related competition, but veterinary related ones?
- Despite its absence on Kaggle, there definitely is no shortage of studies that attempt to use tools like a machine learning in order to solve certain problems related to vertinary science, take the study by Schofield et al, just as one good example & a recent review; ML applications in the veterinary field
- There is also no shortage of experts in the veterinary field (@avma article) that praise and encourage the use of ML/AI in the field, but clinics aren't exactly in a rush to integrate DS & ML engineers, mostly only in places where there is substantial funding.
- Another obvious problem arises as well; where to get data? There obviously isn't a strong desire to release clinical data publically, yet there is a lot of commonality between the two fields: take ultrasound imaging for example, veterinary related analyses also utilise ultrasound machines. The dataset used in this notebook, is a summation of different journals that from which the data has been nicely assembled at published on @data.mendeley, so that's one place we can access data.
- Veterinary science & Machine learning is definitely an exciting field, especially if you love animals & just to fit the theme, here we'll be looking at a classification machine learning problem using the CatBoost classifier
❯❯ APPLICATION IN REPRODUCTOLOGY¶
- This notebook is about feline reproductology, as a good friend of mine works as one, and hence I wanted to understand the topic a little better for myself
- The job of reproductologists is mainly to guide & ensure safe delivery of offspring. Felines, just like humans go through similar processes, so that's what we'll look at here.
❯❯ PREGNANCY DIAGNOSIS¶
Main methods for pregnancy diagnosis:
- Ultrasound imaging (most common method for domestic cats)
- Longitudinal Endocrine Assessments (progestins, prostaglandins, relaxin) (most common method overall)
- Fecal Protein Assessments (hormone assessments for non-domestic cats)
- Vaginal Crytology
- Laparoscopy
Outlined approaches are relevant to the dataset used in this notebook & we'll actually be looking at ways to combine information from both, and do some classification
❯❯ PHASES OF AN ESTROUS CYCLE¶
The article Estrous Cycle outlines four main phases of the estrous cycle:
- Proestrus
- Estrus
- Metestrus/Diestrus
- Anestrus/Basal
- The dataset includes reference to three of the four stages: Estrus,Diestrus & Basal phases
- This is quite a general division & indeed in this dataset, diestrus, is further be divided into two phases:
- NPLP (pseudo-pregnancy)
- PLP (pregnant)
- Description added in Section 2.2 about the differences between the two
Another reference that outlines the behavioural phases as well | Recreated from Amanda Petersen PhD shows the order of the estrous cycle
❯❯ ESTROUS CYCLE RELATED HORMONES¶
- The dataset makes reference to two hormones; estradiol & progesterone, which are both reproductive hormones
- Despite the human reproduction related description, their properties shouldn't really change for animals, although the human & animal pregnancy does have its differences
- Of the two hormones, I am led to believe that measurement of progesterone levels instead of estradiol are quite popular in veterinarian clinics
ESTRADIOL¶
Estradiol (PubChem NCBI) | Snipplet from Endocrine.org about estradiol in humans:
Also called Oestradiol (E2) is the strongest of the three estrogens and an important player in the female reproductive system and the most common type for women of childbearing age. While men and women have estradiol, and it has a role in both of their bodies, women have much higher levels of the hormone than men.
Estradiol has several functions in the female body. Its main function is to mature and then maintain the reproductive system. During the menstrual cycle, increased estradiol levels cause the maturation and release of the egg, as well as the thickening of the uterus lining to allow a fertilized egg to implant. The hormone is made primarily in the ovaries, so levels decline as women age and decrease significantly during menopause. In men, proper estradiol levels help with bone maintenance, nitric oxide production, and brain function. While men need lower levels than women, they still require this important hormone to function well.
PROGESTERONE¶
Progesterone (PubChem NCBI) | Snipplet from Endocrine.org about progesterone in humans:
Progesterone is a steroid hormone belonging to a class of hormones called progestogens. It is secreted by the corpus luteum, a temporary endocrine gland that the female body produces after ovulation during the second half of the menstrual cycle.
Progesterone prepares the endometrium for the potential of pregnancy after ovulation. It triggers the lining to thicken to accept a fertilized egg. It also prohibits the muscle contractions in the uterus that would cause the body to reject an egg. While the body is producing high levels of progesterone, the body will not ovulate. If the woman does not become pregnant, the corpus luteum breaks down, lowering the progesterone levels in the body. This change sparks menstruation. If the body does conceive, progesterone continues to stimulate the body to provide the blood vessels in the endometrium that will feed the growing fetus. The hormone also prepares the limit of the uterus further so it can accept the fertilized egg. Once the placenta develops, it also begins to secrete progesterone, supporting the corpus luteum. This causes the levels to remain elevated throughout the pregnancy, so the body does not produce more eggs. It also helps prepare the breasts for milk production.
❯❯ MONITORING HORMONE LEVELS¶
Two methods of testing estradiol & progesterone levels in felines are used in the dataset:
ENDOCRINE MONITORING¶
Endocrine Tests @Topdoctors.co.uk
... In many cases, urine and bloods tests are used to check your hormone levels, in some cases, Imaging tests are done to pinpoint or locate a tumor or other abnormalities that may be affecting the endocrine glands.
- More invasive approach than fecal protein assessment, so is commonly used for domesticated felines
Elevated levels during pregnancy
- Common & quite straighforward method for diagnosing pregnancy in most mammals;
- detection of elevated circulating progesterone & FPM concentrations
- In felids, circulating progesterone & FPM are highly elevated during pregnancy, although peak concentrations vary significantly between different species.
NPLP & PLP Distinguishibility Difficulty
- The use of progesterone or FPM concentrations for detecting pregnancy is possible but complicated by the potential for prolonged non-pregnant luteal phases (NPLP)
- Progesterone levels are similar during both NPLP and pregnant luteal phases (PLP)
- NPLP are approx half the duration of PLP for most felids, thus pregnancy can be confirmed by detecting elevated progesterone or FPM concentrations during the later half of gestation
- Neither circulating progesterone or FPM concentrations can be used for pregnancy detection in Lynx:
- Progesterone based assessments further complicated by:
- The potential for temporary mid-gestational decreases in progesterone concentrations, which are thought to be associated with a switch from luteal to placemental progestone productions, leading to false-negative pregnancy diagnoses
- Due to the fact that they exhibit an abnormally long NPLP and PLP, with CL persisting for at least two years
- Meaning that NPLP & PLP cannot be distinguished using progesterone or FPM assessments alone
FECAL PROTEIN ASSESSMENTS¶
- In felids, estradiol & progesterone metabolities are almost exclusively excreted through feces
- FEM, FPM are indirect and noninvasive means of monitoring blood concentrations of these hormones
- Fecal estrogens and progestins tend to fluctuate less than circulating estradiol & progesterone
- However extraction of progesterone and estradiol is much slower compared to endocrine tests, nevertheless it's still a method commonly used for wild felines
❯❯ MACHINE LEARNING APPLICATION¶
The Complex Nature of Hormonal Changes
- Female felines, just like women go through maternal cycles, during which the body adapts to changes & hormone levels constantly fluctuate
- The idea here is to use hormone measurement data that have already been classified by humans, and build a unified model that can classify at which phase of the cycle the feline of interest is at the moment of testing, whatever the testing method.
- This in itself is a very complex task, as felines can have very different homonal backgrounds/profiles, and not to even mention the subtle variation of changes that can occur as a result of attempting to obtain the samples in the first place (when doing evasive type testing)
- As mentioned above, there is also difficulty in distinguishing between NPLP and PLP phases, which both occur during the diestrus phase:
- NPLP, as the name implies is not at a pregnant phase yet,
- Wheras as PLP is as the name suggest the pregnant luteal phase
A Classification Problem
- The article, Monitoring ovarian function and detecting pregnancy in felids: a review, from which this dataset was obtained, focused on the need to develop methodologies for monitoring of estrus and pregnancy in felines, mostly due to the need for noninvasive testing methods for wild felines
- This isn't the aim of this notebook, but we can utilise the data for the purposes of applying machine learning methods to an interesting application; classification of estrous cycle based on feline hormonal data (Estradiol & Progesterone) or just ovarian phase classification
Notebook Aim
- So the aim is to build a classification model that will unify different testing approaches (fem,fpm,serum) and be able to classify correctly based on either progesterone or estradiol levels, in whichever combination (min,max,mean), at which phase of the estrous cycle the feline is at the moment of testing
- The reason a unified method may be of interest because:
- Insufficient data can result in inconsistent models each time it's trained
- There should be a correlation between methods (definitely serum & fecal testing)
- Any new methodology & subsequent feature extraction can complement one another, taking in more information about the entire process
Purpose of such a model
- Such a model can help veterinarians quite rapidly confirm their own diagnosis (through whatever method they were taught to use) or at least be tool to question their own diagnosis.
- This particular application may not be the most practical as I don't work in the industry, but it serves the purpose outlined below:
- In essence, we'll aim to show that machine learning models can capture tendencies & relations in data that can be hard to notice to the naked eye, which is why their use can help solve many problems in veterinary science, this is one of many possible applications.
❯❯ RELATED LINKS¶
¶
❯❯ LIST OF FEATURES¶
The features are divided into two main groups / four subgroups of results:
0-4 : Plasma or serum concentrations
- 0-1 : Circulating Estradiol (pg/ml)
- 0 : Anoestrus/Interestrus Basal
- 1 : Estrus (Peak)
- 2-4 : Circulating Progesterone (ng/ml)
- 2 : Basal (i.e. not diestrus)
- 3 : Diestrus/luteal phase (Peak) | Non-pregnant luteal phase NPLP
- 4 : Diestrus/luteal phase (Peak) | Pregnant luteal phase PLP
- 0-1 : Circulating Estradiol (pg/ml)
5-9 : Fecal metabolites
- 5-6 : Fecal estradiol metabolites (FEM) (ng/g)
- 5 : Anoestrus or interestrus Basal
- 6 : Estrus (Peak)
- 7-9 : Fecal progesterone metabolites (FPM) (µg/g)
- 7 : Basal (i.e. not diestrus)
- 8 : Diestrus/luteal phase (Peak) | Non-pregnant luteal phase NPLP
- 9 : Diestrus/luteal phase (Peak) | Pregnant luteal phase PLP
- 5-6 : Fecal estradiol metabolites (FEM) (ng/g)
For each feature we have 5 features:
min
,max
,mean
,number of animals (n)
&number of samples (ns)
So from our data we are dealing with sets of results of either estradiol or progesterone levels of
ns
animals, from whom the authors of the individual sets of data sampledn
times
❯❯ ADDITIONAL DESCRIPTION¶
Some extracts from references that are relevant to this data:
Luteal Phase
- The luteal phase plays an important role in early pregnancy, as it's the time when the womb prepares for the implantation of a fertilized egg
- The luteal phase lasts from the day after ovulation until the day before your period starts
NPLP & PLP Difference - Jilian M. Fazio PhD
- A non-pregnant luteal phase (NPLP) was defined as a rise in progestogens 2.0 SD above baseline for at least fourteen days starting from the first to the last dates above baseline. The end of the luteal phase was defined as a return to baseline for at least six days
- A pregnant luteal phase (PLP) was an elevation of progestogens 2.0 SD above baseline for greater than or equal to fourteen days that resulted in live or stillbirth
¶
❯❯ READ DATA¶
- The feline pregnancy dataset contains column data without explicitly stating what the features are, so having defined them in (Section 2.1), we can generate short names, so it will be more clear what they actually are as we are working with them.
- Some minor adjustments are also made to replace certain characters as shown below & in some cases "<1" is used, for this case it's assumed to be 0.5.
# read the data
df = pd.read_csv('/kaggle/input/feline-pregnancy/feline_pregnancy.csv',delimiter=';')
# generate names (as csv uses abbreviations)
features = ['e-basal','e-estrus','p-basal','p-nplp','p-plp',
'fem-basal','fem-estrus','fpm-basal','fpm-nplp','fpm-plp']
stat = ['mean','min','max','n','ns']
new_names = ['linage','species']
for i in features:
for j in stat:
new_names.append(i+j)
# change some minor things
df.columns = new_names # set column names to those we generated
df = df.apply(lambda x: x.str.replace(',','.'))
df = df.apply(lambda x: x.str.replace('~','')) # approximation sign
feline_progest = df
# some data uses <1, so let's approximate it
feline_progest = feline_progest.apply(lambda x: x.str.replace('<1','0.5'))
# convert data to float
feline_progest.iloc[:,2:] = feline_progest.iloc[:,2:].astype('float')
# check if rows are na
# feline_progest['is_na'] = df[df.columns].isnull().apply(lambda x: all(x), axis=1)
# feline_progest['is_na'].value_counts()
- The dataset contains a combination of features outlined in Section 2.1, totalling 51 features, which is quite high in comparison to the number of entries:
- Features 2-12 are measurements of estradiol in circulating blood during basal/estrus phases
- Features 13-27 are measurements of progesterone in circulating blood during basal/diestrus phases
- Features 28-37 are measurements of fecal estradiol metabolites in feces during basal/estrus phases
- Features 38-52 are measurements of fecal progesterone metabolites in feces during basal/diestrus phases
- We have quite a few gaps in our dataset, especially for circulating blood data, which indicates that endocrine monitoring was not done as consistently as fecal protein assessments and less data was provided by individual authors
- From the first few data points as shown in
.head()
alone, we can note that we have various combinations of missing data
feline_progest.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 109 entries, 0 to 108 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 linage 109 non-null object 1 species 109 non-null object 2 e-basalmean 18 non-null object 3 e-basalmin 19 non-null object 4 e-basalmax 19 non-null object 5 e-basaln 19 non-null object 6 e-basalns 18 non-null object 7 e-estrusmean 14 non-null object 8 e-estrusmin 13 non-null object 9 e-estrusmax 13 non-null object 10 e-estrusn 16 non-null object 11 e-estrusns 15 non-null object 12 p-basalmean 27 non-null object 13 p-basalmin 20 non-null object 14 p-basalmax 20 non-null object 15 p-basaln 25 non-null object 16 p-basalns 25 non-null object 17 p-nplpmean 16 non-null object 18 p-nplpmin 14 non-null object 19 p-nplpmax 14 non-null object 20 p-nplpn 14 non-null object 21 p-nplpns 16 non-null object 22 p-plpmean 10 non-null object 23 p-plpmin 7 non-null object 24 p-plpmax 7 non-null object 25 p-plpn 10 non-null object 26 p-plpns 10 non-null object 27 fem-basalmean 40 non-null object 28 fem-basalmin 30 non-null object 29 fem-basalmax 30 non-null object 30 fem-basaln 40 non-null object 31 fem-basalns 41 non-null object 32 fem-estrusmean 44 non-null object 33 fem-estrusmin 41 non-null object 34 fem-estrusmax 41 non-null object 35 fem-estrusn 44 non-null object 36 fem-estrusns 46 non-null object 37 fpm-basalmean 37 non-null object 38 fpm-basalmin 28 non-null object 39 fpm-basalmax 27 non-null object 40 fpm-basaln 41 non-null object 41 fpm-basalns 41 non-null object 42 fpm-nplpmean 38 non-null object 43 fpm-nplpmin 39 non-null object 44 fpm-nplpmax 39 non-null object 45 fpm-nplpn 38 non-null object 46 fpm-nplpns 39 non-null object 47 fpm-plpmean 27 non-null object 48 fpm-plpmin 22 non-null object 49 fpm-plpmax 22 non-null object 50 fpm-plpn 26 non-null object 51 fpm-plpns 27 non-null object dtypes: object(52) memory usage: 44.4+ KB
pd.set_option('max_columns', None)
feline_progest.head()
linage | species | e-basalmean | e-basalmin | e-basalmax | e-basaln | e-basalns | e-estrusmean | e-estrusmin | e-estrusmax | e-estrusn | e-estrusns | p-basalmean | p-basalmin | p-basalmax | p-basaln | p-basalns | p-nplpmean | p-nplpmin | p-nplpmax | p-nplpn | p-nplpns | p-plpmean | p-plpmin | p-plpmax | p-plpn | p-plpns | fem-basalmean | fem-basalmin | fem-basalmax | fem-basaln | fem-basalns | fem-estrusmean | fem-estrusmin | fem-estrusmax | fem-estrusn | fem-estrusns | fpm-basalmean | fpm-basalmin | fpm-basalmax | fpm-basaln | fpm-basalns | fpm-nplpmean | fpm-nplpmin | fpm-nplpmax | fpm-nplpn | fpm-nplpns | fpm-plpmean | fpm-plpmin | fpm-plpmax | fpm-plpn | fpm-plpns | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | 3.0 | 3.0 | 25.8 | NaN | NaN | 5.0 | 12.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Domestic cat | Domestic cat | 8.1 | 4.3 | 11.9 | 4.0 | 12.0 | 59.5 | 46.1 | 72.9 | 4.0 | 13.0 | 0.5 | NaN | NaN | NaN | NaN | 24.6 | 19.0 | 31.0 | 4.0 | 4.0 | 34.9 | 29.0 | 41.0 | 2.0 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Domestic cat | Domestic cat | 11.7 | 6.9 | 16.5 | 39.0 | 106.0 | NaN | 50.0 | 70.0 | 39.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.2 | 0.1 | 0.3 | 7.0 | 32.0 | 17.2 | 9.0 | 25.0 | 7.0 | 12.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.5 | NaN | NaN | NaN | NaN | NaN | 30.9 | 87.8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
❯❯ MISSING DATA¶
- This dataset contains measurement data of estradiol & progesterone for both serum & fecal, however the entries are quite highly inconsistent and often only contain one or two readings from the combinations (min,max,mean), as mentioned previously.
- As a result, we'll be needing to meld the data back into the specific phase group (basal,estrus,nplp & plp), in order to have enough data for creating a model for phase prediction, since that is ultimately our goal in this notebook
- As we can see from the data below, serum based data sampling tends to have the highest amount of missing data, as opposed to fecal measurements, for which we have much more consistent data.
# Function that plots the number of NaN data in the entire dataset
def plot_na(df):
series1 = df.isna().sum()
series2 = df.notnull().sum()
series3 = series1 + series2
series1.name = 'NaN'
series3.name = 'All Data'
fig = go.Figure(data=[px.bar(series1)['data'][0],
px.bar(series3)['data'][0]])
fig.update_layout(template='plotly_white',height=300,
font=dict(family='sans-serif',size=12))
fig.update_layout(showlegend=False,title='NaN Distribution in Entire Dataset')
fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
marker_line_width=1.5, opacity=0.4)
fig.show('svg',dpi=300)
plot_na(feline_progest)
❯❯ SAMPLE SIZES & SPECIES¶
- The data we are dealing with from individual studies provide control group statistic, rather than individual samples (unless of course ns=1, which do exist in the dataset), so what our features indicate are pretty much bounds for each linage & specie combination & and we can extract the individual samples to some extent (min,max cases for certain)
- Our mean, min & max will have been influenced by the number of samples and animals in each study, however as we don't have individual samples, there seems to be little use for this information
- Visually, we can see/confirm below that there is quite a substatial variation among different specie & linage combinations
import seaborn as sns; sns.set(style='whitegrid')
''' Plot Verticle Heatmap using Plotly '''
def plotlyoff_heatmap(hm,size=None):
fig,ax = plt.subplots(ncols=2,figsize=(15,7),)
sns.heatmap(hm[0],ax=ax[0],annot=False)
sns.heatmap(hm[1],ax=ax[1],annot=False)
ax[0].set_title("Number of Felines")
ax[1].set_title("Number of Samples")
plt.tight_layout()
plt.show()
sample_sizes = feline_progest[['linage','species','e-basaln','e-basalns','e-estrusn',
'e-estrusns','p-basaln','p-basalns','p-plpn','p-plpns',
'p-nplpn','p-nplpns','fem-basaln','fem-basalns',
'fem-estrusn','fem-estrusns','fpm-basaln','fpm-basalns',
'fpm-nplpn','fpm-nplpns','fpm-plpn','fpm-plpns']]
# sample_sizes.groupby('species').max()
pt = pd.pivot_table(sample_sizes,index=['species','linage'])
pt.index=pt.index.get_level_values(0)+"("+pt.index.get_level_values(1)+")" # merge multindex
n_pt = pt.iloc[:,::2] # number of animals
ns_pt = pt[['e-basalns','e-estrusns','p-basalns','p-plpns','p-nplpns',
'fem-basalns','fem-estrusns','fpm-basalns','fpm-nplpns',
'fpm-plpn','fpm-plpns']]
plotlyoff_heatmap([n_pt,ns_pt])
- We'll also drop features relating to sample size & number of animals in each study (
ns
&n
) - Although there may be some benefit of including them, however it seems they aren't really necessary in this problem
# Drop Sample number data
feline_progest.drop(['e-basaln','e-basalns','e-estrusn','e-estrusns',
'p-basaln','p-basalns','p-plpn','p-plpns','p-nplpn','p-nplpns',
'fem-basaln','fem-basalns','fem-estrusn','fem-estrusns',
'fpm-basaln','fpm-basalns','fpm-nplpn','fpm-nplpns',
'fpm-plpn','fpm-plpns'],axis=1,inplace=True)
# Our premodel dataset
feline_progest.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 109 entries, 0 to 108 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 linage 109 non-null object 1 species 109 non-null object 2 e-basalmean 18 non-null object 3 e-basalmin 19 non-null object 4 e-basalmax 19 non-null object 5 e-estrusmean 14 non-null object 6 e-estrusmin 13 non-null object 7 e-estrusmax 13 non-null object 8 p-basalmean 27 non-null object 9 p-basalmin 20 non-null object 10 p-basalmax 20 non-null object 11 p-nplpmean 16 non-null object 12 p-nplpmin 14 non-null object 13 p-nplpmax 14 non-null object 14 p-plpmean 10 non-null object 15 p-plpmin 7 non-null object 16 p-plpmax 7 non-null object 17 fem-basalmean 40 non-null object 18 fem-basalmin 30 non-null object 19 fem-basalmax 30 non-null object 20 fem-estrusmean 44 non-null object 21 fem-estrusmin 41 non-null object 22 fem-estrusmax 41 non-null object 23 fpm-basalmean 37 non-null object 24 fpm-basalmin 28 non-null object 25 fpm-basalmax 27 non-null object 26 fpm-nplpmean 38 non-null object 27 fpm-nplpmin 39 non-null object 28 fpm-nplpmax 39 non-null object 29 fpm-plpmean 27 non-null object 30 fpm-plpmin 22 non-null object 31 fpm-plpmax 22 non-null object dtypes: object(32) memory usage: 27.4+ KB
❯❯ FELINE SPECIE DISTRIBUTION¶
- The dataset contains estradiol & progesterone for not only domesticated felines, but also for non-domesticated cats such as Pantheras, Pumas, Ocelots, Leopard, Lynx, Caracals & Bay Cats
- So despite the large portion of domestic cats, most of the dataset contains data for wild felines, as shown in the graph below
# function to show value_counts using plotly
def plot_count(df,feature,orie='v',h=400):
series = df[feature].value_counts()
fig = px.bar(series,orientation=orie,color='value')
fig.update_layout(template='plotly_white',height=h,
font=dict(family='sans-serif',size=12))
fig.update_layout(showlegend=False)
fig.update_traces(marker_color='rgb(158,202,225)',
marker_line_color='rgb(8,48,107)',
marker_line_width=1.5, opacity=0.6)
fig.update_traces(width=0.75)
fig.show('svg',dpi=300)
plot_count(feline_progest,'linage',orie='h',h=300)
plot_count(feline_progest,'species',orie='v',h=400)
❯❯ ESTRADIOL LEVEL DISTRIBUTION¶
- Inspecting the estradiol levels for different linage we can note some quite clear tendencies for the mean values
- All felines have increased estradiol levels during estrus, which was expected, the only variation that exists is that for different species the levels were quite different
- The same tendencies were observed for both serum and fecal data subsets
- The levels for both maximum & minimum levels of estradiol:
- Followed similar trends, however in very few cases, there have been dips in values during estrus as seen in the leopard data
- Such abnormalities of course can catch the model off guard, as the entire dataset follows an increasing trend
- Another observation can be made about outliers:
There are very cases that fall outside the minimum & upper fence levels, as can be seen some levels of domesticated felines are abnormally high
# function to show data distribution using plotly boxplot
def plot_strip(ldf,features,plot_id='box',
color=None,title=None):
tdf = ldf[features]
del tdf['species']
if(plot_id is 'box'):
fig = px.box(tdf,orientation='v',color=color,facet_col=color,
color_discrete_sequence= px.colors.sequential.Plasma_r,
facet_col_wrap =4)
elif(plot_id is 'strip'):
fig = px.strip(tdf,orientation='v',color=color)
fig.update_layout(template='plotly_white',height=700,
title=f'{title}',
font=dict(family='sans-serif',size=12))
fig.update_traces(width=0.25)
fig.update_layout(showlegend=False)
fig.update_traces(marker_color='#056293', marker_line_color='rgb(8,48,107)',
marker_line_width=1.5, opacity=0.7)
fig.show('svg',dpi=300)
main_lst = feline_progest.columns.tolist()
lst = map(lambda x:main_lst[x],[0,1,
2,5,
3,6,
4,7])
plot_strip(feline_progest,lst,'box',
'linage','Serum Concentrations - Estradiol (pg/ml)')
main_lst = feline_progest.columns.tolist()
lst = map(lambda x:main_lst[x],[0,1,
17,20,
18,21,
19,22])
plot_strip(feline_progest,lst,'box',
'linage','Fecal Metabolities (FEM) - Estradiol (ng/g)')
❯❯ PREGESTERONE LEVEL DISTRIBUTION¶
- Next we'll look at the progesterone levels, for this hormone subset we have three phases that were recorded; basal & diestrus (nplp/plp), so estrus was not included
- Like elevated estradiol levels during estrus, progesterone levels are also elevated in post basal phases (nplp,plp)
- Unlike estradiol, here we have quite a bit more variety when it comes to plp & nplp levels, for different linage variations, looking at the mean values:
- Domesticated felines tend to have slightly higher levels of progasterone during nplp than plp in serum
- Lynx on the other hand tend to have higher progasterone levels during plp in serum (so the other way round)
- Panthera tend to follow the same trend as domesticated felines
Fecal data provided some insight into some other linage:
- leopard cats, similar to Lynx have tendencies of having larger plp values comared to nplp, Puma also follow this trend
- It can be noted that the these trend variations are very subtle, both plp & nplp are not too different
- This seems to suggest that there can be some difficult to distinguish between these two phases if we were to just look at progesterone concentrations alone
- We can also note that we have clear gaps in data for leopard, caracal & bay cat when it comes to serum data, only fecal data was collected
main_lst = feline_progest.columns.tolist()
lst = map(lambda x:main_lst[x],[0,1,
8,11,14,
9,12,15,
10,13,16])
plot_strip(feline_progest,lst,'box',
'linage','Serum Concentrations - Progasterone (ng/ml)')
main_lst = feline_progest.columns.tolist()
lst = map(lambda x:main_lst[x],[0,1,
23,26,29,
24,27,30,
25,28,31])
plot_strip(feline_progest,lst,'box',
'linage','Fecal Metabolities (FPM) - Progasterone (µg/g)')
¶
❯❯ REARRANGING DATA¶
Compiling data from different data sources:
- We are left with a lot of variation in the recorded data, thus have lot's of NaN as available data is not consistent since they are taken from different sources
- We can have any combination of (min,max,mean), but what we are measuring is the same content; estrus or progasterone, from serum or feces.
- Thus we can simply meld our data & divide the data based on the phase it was allocated into; basal or estrus ..., the actual phase is in the column name.
- The rearranged data is then simply one-hot encoded, so it's not relevant which measurement (min,max,mean) was recorded
❯❯ LIST OF MODELS¶
- As mentioned in Section 2.1, we have two main groups (serum,fecal) & 4 subgroups (serum-e,serum-p,fecal-e,fecal-p) in total
- We'll first look at making models for each subgroup of data & then make combined models of subgroups
Subgroup Models
- First model should be straightforward; binary classification between basal & estrus phases using serum estrous data
- Second model should be more challenging; multiclass classification between basal & diestrus phases (nplp,plp) using serum progasterone data
- Third model should be straightforward; binary classification between basal & estrus phases using fecal estrous data this time
- Fourth model; multiclass classification between basal & diestrus phases (nplp,plp) again, but this time using fecal progasterone data as well
Grouped Models
- The fifth model; multiclass classification between all available basal & diestrus(nplp,plp) phase progasterone data
❯❯ EVALUATION FUNCTION¶
model_eval¶
- As with most of my other notebooks, I like to create a unified evaluation function that can be reused in different problems, but opted with a function for the time being
model_eval
can be used to evaluate the model on a train/test splitting approach, a little more simpler than for eg. kfold cv (cv is of course important for any ml problem), which I used in another notebook- We'll add kfold cross validation for the grouped models as it is a little more viable to split the data into more than two groups & it is definitely desirable to cross validate the models.
model_eval
function parameters:
data
: Feature Matrix & Target Variable DataFramets
: Train/Test split ratiotarget
: Target Feature in the DataFrameclf
: Classifierclf
: Evaluation case name (for unique model save)classif_id
: Classification Type (could just have automated it)show_id
: Evaluation options; (conf,roc)
Evaluation Metric
- We'll be using very standard metrics for both binary and multiclass classification; The confusion matrix and ROC curves
- Plotly is of course my prefered way of plotting things & they recently updated their plot library to include ROC & PR Curves, which is quite a helpful reference for ML beginners
# Standard Train/Test Split Validation
def model_eval(data, # data input
ts=0.3, # train/test split ratio
target = 'id', # target feature
clf=None, # model list
clf_name='model',
classif_id='binary', # type of classification
show_id=['roc']): # output options
# train/test split
y = data[target]
X = data.drop(target,axis=1)
if(clf is not None):
clf = clf[1]
# default model if not model is selected
else:
clf = RandomForestClassifier(max_depth=10,
random_state=0)
# Train/Test split our dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=ts)
# Show the split distribution
print(f'Training Samples: {X_train.shape[0]}')
print(f'Test Samples: {X_test.shape[0]}')
# train model
clf.fit(X_train,y_train)
clf.save_model(f"{clf_name}")
# predict on training data & test data
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
# Evaluate Metrics
# print("Accuracy:",metrics.accuracy_score(y_train, y_pred_train))
# print("Accuracy:",metrics.accuracy_score(y_test, y_pred_test))
# Plot Confusion Matrix for Training / Test Data
if('conf' in show_id):
data1 = confusion_matrix(y_train,y_pred_train)
data2 = confusion_matrix(y_test,y_pred_test)
''' Plot Verticle Heatmap using Plotly '''
def plotlyoff_heatmap(hm,size=None):
fig,ax = plt.subplots(ncols=2,figsize=(8,4),)
sns.heatmap(hm[0],ax=ax[0],annot=True)
sns.heatmap(hm[1],ax=ax[1],annot=True)
ax[0].set_title("Training Confusion Matrix")
ax[1].set_title("Test Confusion Matrix")
plt.tight_layout()
plt.show()
data1 = pd.DataFrame(data1)
data1.index = clf.classes_
data1.columns = clf.classes_
data2 = pd.DataFrame(data2)
data2.index = clf.classes_
data2.columns = clf.classes_
plotlyoff_heatmap([data1,data2])
# Plot ROC Curves for Training / Test Data
if('roc' in show_id):
fig = make_subplots(rows=1,cols=2,subplot_titles=['Train','Test'])
if(classif_id is 'binary'):
y_score_train = clf.predict_proba(X_train)[:, 1]
y_score_test = clf.predict_proba(X_test)[:, 1]
ii=-1; iii=0
lst_in_X = [X_train,X_test]
lst_in_y = [y_train,y_test]
lst_subgroup = [y_score_train,y_score_test]
lst_name = ['train','test']
for group in lst_subgroup:
ii+=1;iii+=1
y_true = lst_in_y[ii]
y_score = lst_subgroup[ii]
y_true = y_true.map({'basal': 0, 'estrus':1})
fpr, tpr, _ = roc_curve(y_true, y_score)
auc_score = roc_auc_score(y_true, y_score)
name = f"({lst_name[ii]} AUC={auc_score:.2f})"
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines',name=name),
col=iii,row=1)
fig.add_shape(type='line', line=dict(dash='dash'),
x0=0, x1=1, y0=0, y1=1,col=iii,row=1)
else:
y_score_train = clf.predict_proba(X_train)
y_score_test = clf.predict_proba(X_test)
ii=-1; iii=0
lst_in_X = [X_train,X_test]
lst_in_y = [y_train,y_test]
lst_subgroup = [y_score_train,y_score_test]
for group in lst_subgroup:
ii+=1; iii+=1
y_onehot = pd.get_dummies(lst_in_y[ii], columns=clf.classes_)
clf.predict_proba(lst_in_X[ii])
# Multiclass ROC Curves
iiii=-1
lst_colour = ['#1C76A5','#69BCE7','#BBE1F5']
for i in range(group.shape[1]):
iiii+=1
y_true = y_onehot.iloc[:, i]
y_score = group[:, i]
fpr, tpr, _ = roc_curve(y_true, y_score)
auc_score = roc_auc_score(y_true, y_score)
name = f"{y_onehot.columns[i]} (AUC={auc_score:.2f})"
fig.add_trace(go.Scatter(x=fpr, y=tpr,
line=dict(color=f"{lst_colour[iiii]}"),
name=name, mode='lines'),col=iii,row=1)
fig.add_shape(type='line', line=dict(dash='dash'),x0=0,
x1=1, y0=0, y1=1,col=iii,row=1)
# Plot Aesthetics
fig.update_xaxes(title_text=f'False Positive Rate',
row=1, col=1, scaleanchor="x", scaleratio=1)
fig.update_xaxes(title_text=f'False Positive Rate',
row=1, col=2, scaleanchor="x", scaleratio=1)
fig.update_yaxes(title_text=f'True Positive Rate',
row=1, col=1, constrain='domain')
fig.update_yaxes(title_text=f'True Positive Rate',
row=1, col=2, constrain='domain')
fig.update_layout(template='plotly_white',height=400)
fig.update_layout(title=f"CatBoost Classifier | ROC Curve")
fig.show('svg',dpi=300)
❯❯ SUBSET MODELS¶
❯❯❯ SERUM MODEL 1 : BASAL | ESTRUS CLASSIFICATION¶
For the first model:
- Using the serum subset to build a binary classifier
- We'll be using the estradiol data in our first model; data from circulating blood
- We want to create a model that can classify, for the limited features available, whether the feline is in the basal or estrust phase of the estrous cycle
# select the relevant data from the main dataframe
upd = feline_progest.iloc[:,0:8]
display(upd.head())
linage | species | e-basalmean | e-basalmin | e-basalmax | e-estrusmean | e-estrusmin | e-estrusmax | |
---|---|---|---|---|---|---|---|---|
0 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Domestic cat | Domestic cat | 8.1 | 4.3 | 11.9 | 59.5 | 46.1 | 72.9 |
2 | Domestic cat | Domestic cat | 11.7 | 6.9 | 16.5 | NaN | 50.0 | 70.0 |
3 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN |
# melt but keep some column values
molten = pd.melt(upd,
id_vars=['linage','species'])
molten.loc[(molten['variable'] == 'e-basalmean') |
(molten['variable'] == 'e-basalmin') |
(molten['variable'] == 'e-basalmax')
, 'id'] = 'basal'
molten.loc[(molten['variable'] == 'e-estrusmean') |
(molten['variable'] == 'e-estrusmin') |
(molten['variable'] == 'e-estrusmax')
, 'id'] = 'estrus'
molten.dropna(inplace=True)
molten_ohe = pd.get_dummies(molten,columns=['linage','species','variable'])
models = []
models.append(('CAT',CatBoostClassifier(silent=True,
n_estimators=25)))
model_eval(molten_ohe,
clf=models[0],
clf_name='model1',
show_id=['roc','conf'])
Training Samples: 67 Test Samples: 29
SUMMARY : MODEL 1¶
- Well, it's quite conclusive that the model can very easily distinguish between the two phases very easily for different kind of cat species basal,estrus) from the serum subset
- Given enough data, a human can quite easily distinguish between the two phases, as the elevated estradiol levels were clearly visible during the estrus phase (Section 3)
- So creating a model to distinguish between basal & estrus phases is not really necessary and more importantly, there are other phases of an estrous cycle that need to be taken into account, as estradiol levels are also elevated during these phases and if we didn't have an expert define that these elevated levels were during the estrus phase, we could easily have mistaken them for diestrus and so on.
❯❯❯ SERUM MODEL 2: BASAL | NPLP | PLP CLASSIFICATION¶
For the second model:
- Using serum data again, but this time we turn our attention to prograsterone concentrations in circulating blood.
- For the second model, we'll be using the prograsterone data obtained from serum & create a model that can classify between three states this time (basal, nplp & plp)
# Select the relevant data
upd2 = feline_progest.iloc[:,pd.np.r_[0:2, 8:17]]
upd2.head()
linage | species | p-basalmean | p-basalmin | p-basalmax | p-nplpmean | p-nplpmin | p-nplpmax | p-plpmean | p-plpmin | p-plpmax | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Domestic cat | Domestic cat | 1.0 | NaN | NaN | 25.8 | NaN | NaN | NaN | NaN | NaN |
1 | Domestic cat | Domestic cat | 0.5 | NaN | NaN | 24.6 | 19.0 | 31.0 | 34.9 | 29.0 | 41.0 |
2 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Domestic cat | Domestic cat | 0.2 | 0.1 | 0.3 | 17.2 | 9.0 | 25.0 | NaN | NaN | NaN |
4 | Domestic cat | Domestic cat | 0.5 | NaN | NaN | NaN | 30.9 | 87.8 | NaN | NaN | NaN |
# melt but keep some column values
molten = pd.melt(upd2,
id_vars=['linage','species'])
molten.loc[(molten['variable'] == 'p-basalmean') |
(molten['variable'] == 'p-basalmin') |
(molten['variable'] == 'p-basalmax')
, 'id'] = 'basal'
molten.loc[(molten['variable'] == 'p-nplpmean') |
(molten['variable'] == 'p-nplpmin') |
(molten['variable'] == 'p-nplpmax')
, 'id'] = 'nplp'
molten.loc[(molten['variable'] == 'p-plpmean') |
(molten['variable'] == 'p-plpmin') |
(molten['variable'] == 'p-plpmax')
, 'id'] = 'plp'
molten.dropna(inplace=True)
molten_ohe = pd.get_dummies(molten,columns=['linage','species','variable'])
molten_ohe
value | id | linage_Domestic cat | linage_Lynx | linage_Ocelot | linage_Panthera | linage_Puma | species_Bobcat | species_Cheetah | species_Clouded leopard | species_Domestic cat | species_Eurasian lynx | species_Iberian lynx | species_Jagaurondi | species_Jaguar | species_Leopard | species_Lion | species_Ocelot | species_Puma | species_Snow leopard | species_Tigers | variable_p-basalmax | variable_p-basalmean | variable_p-basalmin | variable_p-nplpmax | variable_p-nplpmean | variable_p-nplpmin | variable_p-plpmax | variable_p-plpmean | variable_p-plpmin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | basal | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0.5 | basal | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0.2 | basal | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0.5 | basal | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0.5 | basal | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
878 | 30.0 | plp | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
892 | 39.9 | plp | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
930 | 27.1 | plp | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
932 | 74.4 | plp | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
934 | 168.0 | plp | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
135 rows × 30 columns
models = []
models.append(('CAT',CatBoostClassifier(silent=True,
n_estimators=25)))
model_eval(molten_ohe,
clf=models[0],
clf_name='model2',
classif_id='multi',
show_id=['roc','conf'])
Training Samples: 94 Test Samples: 41
SUMMARY : MODEL 2¶
- From the results we can see that the CatBoost model can quite easily distinguish between the three phases (basal,nplp & plp), as seen in the confusion matrix for both training & test sets, there are no false positives, which is very encouraging for the multiclass classifier.
- Whilst it may seem that using ohe for each combination may be too farfetched which boosts the models performance, it worth mentioning that it's not really understood why there is such a large variation of inconsistencies in data entries among all the data sources. Nevertheless, it's probably worth keeping these options (min,mean,max) as we might have obtain new data in a similar format
❯❯❯ FECAL MODEL 3 : BASAL | ESTRUS CLASSIFICATION¶
For the third model:
- We'll be using only the fecal data this time for estradiol measurements (fem)
- As with model 1, we'll be making a binary classifier to distinguish between basal & estrus phases
upd3 = feline_progest.iloc[:,pd.np.r_[0:2, 17:23]]
upd3.head()
linage | species | fem-basalmean | fem-basalmin | fem-basalmax | fem-estrusmean | fem-estrusmin | fem-estrusmax | |
---|---|---|---|---|---|---|---|---|
0 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN |
# melt but keep some column values
molten = pd.melt(upd3,
id_vars=['linage','species'])
molten.loc[(molten['variable'] == 'fem-basalmean') |
(molten['variable'] == 'fem-basalmin') |
(molten['variable'] == 'fem-basalmax')
, 'id'] = 'basal'
molten.loc[(molten['variable'] == 'fem-estrusmean') |
(molten['variable'] == 'fem-estrusmin') |
(molten['variable'] == 'fem-estrusmax')
, 'id'] = 'estrus'
molten.dropna(inplace=True)
molten_ohe = pd.get_dummies(molten,columns=['linage','species','variable'])
molten.dropna(inplace=True)
molten_ohe = pd.get_dummies(molten,columns=['linage','species','variable'])
molten_ohe
value | id | linage_Bay cat | linage_Caracal | linage_Domestic cat | linage_Leopard cat | linage_Lynx | linage_Ocelot | linage_Panthera | linage_Puma | species_Asiatic golden cat | species_Black footed cat | species_Canadian lynx | species_Caracal | species_Cheetah | species_Clouded leopard | species_Domestic cat | species_Fishing cat | species_Jaguar | species_Leopard | species_Leopard cat | species_Lion | species_Margay | species_Pallas cat | species_Snow leopard | species_Tigers | species_tigrina | variable_fem-basalmax | variable_fem-basalmean | variable_fem-basalmin | variable_fem-estrusmax | variable_fem-estrusmean | variable_fem-estrusmin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16 | 134.0 | basal | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
17 | 127.1 | basal | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
18 | 34.3 | basal | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
22 | 38.2 | basal | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
23 | 40.0 | basal | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
646 | 2031.0 | estrus | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
649 | 15980.0 | estrus | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
650 | 2293.0 | estrus | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
651 | 250.0 | estrus | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
652 | 354.0 | estrus | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
226 rows × 33 columns
models = []
models.append(('CAT',CatBoostClassifier(silent=True,
n_estimators=25)))
model_eval(molten_ohe,
clf=models[0],
clf_name='model3',
classif_id='multi',
show_id=['roc','conf'])
Training Samples: 158 Test Samples: 68
SUMMARY : MODEL 3¶
- As with model 1, it's quite straighforward for the model to distringuish between both classes, which was expected as a human can do this as well
❯❯❯ FECAL MODEL 4: BASAL | NPLP | PLP CLASSIFICATION¶
For the fourth model:
- Using fecal data this time, we turn our attention to prograsterone (fem) concentrations extracted from fecies
- As with the second model we want to create a model that can classify between three states this time (basal, nplp & plp)
- As was seen in the serum model, the subtle difference in progasterone levels can cause some issues for the model
upd4 = feline_progest.iloc[:,pd.np.r_[0:2, 23:32]]
upd4.head()
linage | species | fpm-basalmean | fpm-basalmin | fpm-basalmax | fpm-nplpmean | fpm-nplpmin | fpm-nplpmax | fpm-plpmean | fpm-plpmin | fpm-plpmax | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Domestic cat | Domestic cat | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# melt but keep some column values
molten = pd.melt(upd4,
id_vars=['linage','species'])
molten.loc[(molten['variable'] == 'fpm-basalmean') |
(molten['variable'] == 'fpm-basalmin') |
(molten['variable'] == 'fpm-basalmax')
, 'id'] = 'basal'
molten.loc[(molten['variable'] == 'fpm-nplpmean') |
(molten['variable'] == 'fpm-nplpmin') |
(molten['variable'] == 'fpm-nplpmax')
, 'id'] = 'nplp'
molten.loc[(molten['variable'] == 'fpm-plpmean') |
(molten['variable'] == 'fpm-plpmin') |
(molten['variable'] == 'fpm-plpmax')
, 'id'] = 'plp'
molten.dropna(inplace=True)
molten_ohe = pd.get_dummies(molten,columns=['linage','species','variable'])
molten_ohe
value | id | linage_Caracal | linage_Domestic cat | linage_Leopard cat | linage_Lynx | linage_Ocelot | linage_Panthera | linage_Puma | species_Black footed cat | species_Canadian lynx | species_Caracal | species_Cheetah | species_Clouded leopard | species_Domestic cat | species_Fishing cat | species_Jaguar | species_Leopard | species_Leopard cat | species_Lion | species_Margay | species_Pallas cat | species_Puma | species_Snow leopard | species_Tigers | species_tigrina | variable_fpm-basalmax | variable_fpm-basalmean | variable_fpm-basalmin | variable_fpm-nplpmax | variable_fpm-nplpmean | variable_fpm-nplpmin | variable_fpm-plpmax | variable_fpm-plpmean | variable_fpm-plpmin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16 | 10.1 | basal | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 9.1 | basal | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 20.3 | basal | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
22 | 2.8 | basal | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
23 | 2.6 | basal | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
970 | 28.7 | plp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
976 | 40.6 | plp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
977 | 13.8 | plp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
978 | 345.0 | plp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
979 | 166.5 | plp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
279 rows × 35 columns
models = []
models.append(('CAT',CatBoostClassifier(silent=True,
n_estimators=25)))
model_eval(molten_ohe,
clf=models[0],
clf_name='model4',
classif_id='multi',
show_id=['roc','conf'])
Training Samples: 195 Test Samples: 84
SUMMARY : MODEL 4¶
- Similar to model 2, the fecal model performs quite well, being able to distinguish between nplp and plp phases quite well
❯❯ UNIFIED MODELS¶
❯❯❯ UNIFIED MODEL 5: BASAL | NPLP | PLP CLASSIFICATION¶
MAIN MODEL ¶
For the fifth model:
- We'll combine both serum and fecal data, this leads to 414 rows of data, which is quite a bit more data
- As noted noted in Section 2.1 and we saw from the plots in Section 3.5 & 3.6, the units are of course different and feature values differ quite substantially
- So it is more ideal to standardise our data, something you can try yourself
# Select the relevant data
upd5 = feline_progest.iloc[:,pd.np.r_[0:2, 8:17,23:32]]
# melt but keep some column values
molten5 = pd.melt(upd5,
id_vars=['linage','species'])
molten5.loc[(molten5['variable'] == 'p-basalmean') |
(molten5['variable'] == 'p-basalmin') |
(molten5['variable'] == 'p-basalmax')
, 'id'] = 'basal'
molten5.loc[(molten5['variable'] == 'p-nplpmean') |
(molten5['variable'] == 'p-nplpmin') |
(molten5['variable'] == 'p-nplpmax')
, 'id'] = 'nplp'
molten5.loc[(molten5['variable'] == 'p-plpmean') |
(molten5['variable'] == 'p-plpmin') |
(molten5['variable'] == 'p-plpmax')
, 'id'] = 'plp'
molten5.loc[(molten5['variable'] == 'fpm-basalmean') |
(molten5['variable'] == 'fpm-basalmin') |
(molten5['variable'] == 'fpm-basalmax')
, 'id'] = 'basal'
molten5.loc[(molten5['variable'] == 'fpm-nplpmean') |
(molten5['variable'] == 'fpm-nplpmin') |
(molten5['variable'] == 'fpm-nplpmax')
, 'id'] = 'nplp'
molten5.loc[(molten5['variable'] == 'fpm-plpmean') |
(molten5['variable'] == 'fpm-plpmin') |
(molten5['variable'] == 'fpm-plpmax')
, 'id'] = 'plp'
molten5.dropna(inplace=True)
molten_ohe5 = pd.get_dummies(molten5,columns=['linage','species','variable'])
molten_ohe5.value = molten_ohe5.value.astype('float')
molten_ohe5
value | id | linage_Caracal | linage_Domestic cat | linage_Leopard cat | linage_Lynx | linage_Ocelot | linage_Panthera | linage_Puma | species_Black footed cat | species_Bobcat | species_Canadian lynx | species_Caracal | species_Cheetah | species_Clouded leopard | species_Domestic cat | species_Eurasian lynx | species_Fishing cat | species_Iberian lynx | species_Jagaurondi | species_Jaguar | species_Leopard | species_Leopard cat | species_Lion | species_Margay | species_Ocelot | species_Pallas cat | species_Puma | species_Snow leopard | species_Tigers | species_tigrina | variable_fpm-basalmax | variable_fpm-basalmean | variable_fpm-basalmin | variable_fpm-nplpmax | variable_fpm-nplpmean | variable_fpm-nplpmin | variable_fpm-plpmax | variable_fpm-plpmean | variable_fpm-plpmin | variable_p-basalmax | variable_p-basalmean | variable_p-basalmin | variable_p-nplpmax | variable_p-nplpmean | variable_p-nplpmin | variable_p-plpmax | variable_p-plpmean | variable_p-plpmin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | basal | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0.5 | basal | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0.2 | basal | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0.5 | basal | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0.5 | basal | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1951 | 28.7 | plp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1957 | 40.6 | plp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1958 | 13.8 | plp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1959 | 345.0 | plp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1960 | 166.5 | plp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
414 rows × 49 columns
models = []
models.append(('CAT',CatBoostClassifier(silent=True,
n_estimators=25)))
model_eval(molten_ohe5,
clf=models[0],
clf_name='model5',
classif_id='multi',
show_id=['roc','conf'])
Training Samples: 289 Test Samples: 125
from sklearn.feature_selection import SelectKBest,f_regression
from xgboost import plot_importance,XGBRegressor
from catboost import CatBoostClassifier,CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing
import shap
import seaborn as sns
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Plot Correlation to Target Variable only
def corrMat(df,target='id',figsize=(9,0.5),ret_id=False):
corr_mat = df.corr().round(2)
shape = corr_mat.shape[0]
corr_mat = corr_mat.transpose()
corr = corr_mat.loc[:, df.columns == target].transpose().copy()
if(ret_id):
return corr
''' Feature Importance '''
# Various Approaches for quick FI evaluation
def fi(ldf,target='id',n_est=25,drop_id=None,target_cat=True):
ldf = ldf.copy()
# If target is categorical string variable
if(target_cat):
cats = ldf[target].unique()
cats_id = [i for i in range(0,len(cats))]
maps = dict(zip(cats,cats_id))
ldf[target] = ldf[target].map(maps)
# If any features are desired to be droped
if(drop_id is not None):
ldf = ldf.drop(drop_id,axis=1)
# Input dataframe containing feature & target variable
y = ldf[target]
X = ldf.drop(target,axis=1)
# CORRELATION
imp = corrMat(ldf,target,figsize=(15,0.5),ret_id=True)
del imp[target]
s1 = imp.squeeze(axis=0);s1 = abs(s1)
s1.name = 'CORR'
# SHAP
model = CatBoostRegressor(silent=True,n_estimators=n_est).fit(X,y)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap_sum = np.abs(shap_values).mean(axis=0)
s2 = pd.Series(shap_sum,index=X.columns,name='CAT_SHAP').T
# CATBOOST
model = CatBoostRegressor(silent=True,n_estimators=n_est).fit(X,y)
fit = model.fit(X,y)
rf_fi = pd.DataFrame(model.feature_importances_,index=X.columns,
columns=['CAT'])
rf_fi.sort_values('CAT',ascending=False)
s3 = rf_fi.T.squeeze(axis=0)
# RANDOMFOREST
model = RandomForestRegressor(n_est,random_state=0, n_jobs=-1)
fit = model.fit(X,y)
rf_fi = pd.DataFrame(model.feature_importances_,index=X.columns,
columns=['RF'])
rf_fi.sort_values('RF',ascending=False)
s4 = rf_fi.T.squeeze(axis=0)
# XGB
model=XGBRegressor(n_estimators=n_est,learning_rate=0.5,verbosity = 0)
model.fit(X,y)
data = model.feature_importances_
s5 = pd.Series(data,index=X.columns,name='XGB').T
# KBEST
model = SelectKBest(k=5, score_func=f_regression)
fit = model.fit(X,y)
data = fit.scores_
s6 = pd.Series(data,index=X.columns,name='KBEST')
# Combine Scores
df0 = pd.concat([s1,s2,s3,s4,s5,s6],axis=1)
df0.rename(columns={'target':'lin corr'})
# MinMax Scaler
x = df0.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled,index=df0.index,columns=df0.columns)
df = df.rename_axis(f'<b>FI APPROACH</b>', axis=1)
df = df.rename_axis('Feature', axis=0)
pd.options.plotting.backend = "plotly"
fig = df.plot(kind='bar',title='<b>SCALED FEATURE IMPORTANCE</b>',
color_discrete_sequence=px.colors.qualitative.T10)
fig.update_layout(template='plotly_white',height=400,
font=dict(family='sans-serif',size=12),
margin=dict(l=60, r=40, t=50, b=10))
fig.update_traces(width=0.85)
fig.show('svg',dpi=300)
INSTANT RELATIVE FEATURE IMPORTANCE ¶
- We can look at the Feature Importance (FI) of certain trained models to understand which features & to what extent.
- We can use such minimalistic functions to quicky evaluat feature importance by relying on variation of approaches & optimised libraries.
- We can obtain relative feature importance using different libraries , function
feature_importance
includes:
- Linear Correlation w/ abs() function.
- SHAP Values of Catboost Regression Model (n_est)
- RandomForest Regressor (n_est)
- XGBoost Regressor (n_est)
- CatBoost Regressor (n_est)
- SelectKBest (k)
POST MODEL ADJUSTMENT ¶
- The indivual scores are combined and scaled using
MinMaxScaler()
& Plot. - The y-axis represents the total score (higher score is better, max -> Number of approaches).
- The x-axis represents the corresponding features of input dataframe.
fi(molten_ohe5,target='id')
SPECIES-LINAGE MODEL ¶
If we rely only on using the linage, species & progasterone values itself, the accuracy of the model drops quite significantly
The model has difficulty correctly classifying between nplp & plp phases as can be seen by the multiclass ROC curves, as well as the confusion matrix
So adding features relating to fecal or serum division data, as well as whether it is a mean, the maximum case or minimum case improves the model accuracy quite substantially because it provides an accurate picture of the data distribution in the multidimensional data, which is expected as CatBoost is quite an advanced model
tmolten_ohe5 = molten_ohe5.iloc[:,0:31]
print(tmolten_ohe5.columns)
models = []
models.append(('CAT',CatBoostClassifier(silent=True,
n_estimators=25)))
model_eval(tmolten_ohe5,
clf=models[0],
clf_name='tmodel5',
classif_id='multi',
show_id=['roc','conf'])
Index(['value', 'id', 'linage_Caracal', 'linage_Domestic cat', 'linage_Leopard cat', 'linage_Lynx', 'linage_Ocelot', 'linage_Panthera', 'linage_Puma', 'species_Black footed cat', 'species_Bobcat ', 'species_Canadian lynx', 'species_Caracal', 'species_Cheetah', 'species_Clouded leopard', 'species_Domestic cat', 'species_Eurasian lynx', 'species_Fishing cat ', 'species_Iberian lynx', 'species_Jagaurondi', 'species_Jaguar', 'species_Leopard', 'species_Leopard cat', 'species_Lion', 'species_Margay', 'species_Ocelot ', 'species_Pallas cat ', 'species_Puma', 'species_Snow leopard', 'species_Tigers', 'species_tigrina'], dtype='object') Training Samples: 289 Test Samples: 125
SUMMARY : MODEL 5¶
- Having combined two approaches that measure the levels of progesterone in both serum and fecies, we can note that the model can still classify quite well, which is quite nice, there are very few missclassified cases & the difference between nplp and plp phases is distinguished quite well
- On the other hand, the model doesn't seem to put emphasis on the variation in both species & linage, which is a shame, because in essence, it should be a key factor in the the variation of progesterone levels
- The model obviously learns to recognise the mean, max & min patterns as well as the variation between serum & fecal data instead to construct an accurate model, so if we don't include this data, the accuracy of the model drop significantly, as the species-linear model demonstrated
❮ CONCLUSION START ❯
- Monitoring ovarian function and detecting pregnancy in felids is important for wild felines
- In this notebook, we focused on creating machine learning models that would classify the ovarian phase based on estradiol & progesterone levels
- We tried two different approaches (basal/estrus), (basal/nplp/plp) based on serum data, as well as fecal only data as well. These models performed very well, the machine learning model has no problem differentiating between the different classes.
- Both fecal and serum data were combined to test the above split groups again (basal,nplp,plp), where the models performed perfectly once again
- However when utilising linage data, we saw a dip in model accuracy, where plp and nplp phases were often missclassified
❮ CONCLUSION END ❯