mpd_talktodata

Module Group¶

src/pd¹

Project Stage ID¶

4[^2]

Purpose¶

The purpose of this library is to allow the user to get to know the data stored in the dataframe using natural language

Location¶

Here are the locations of the relevant files associated with the module

module information:

/src/pd/mpd_talktodata.json

module activation functions:

/src/pd/mpd_talktodata.py

Requirements¶

Required module import information

import numpy as np
import pandas as pd
from collections import OrderedDict
from mllibs.nlpi import nlpi
from mllibs.nlpm import parse_json
import pkg_resources
import json

Selection¶

Activation functions need to be assigned a unique label. Here's the process of label & activation function selection

def sel(self,args:dict):

    self.select = args['pred_task']
    self.args = args

    if(self.select == 'dfcolumninfo'):
        self.dfgroupby(self.args)
    if(self.select == 'dfsize'):
        self.dfsize(self.args)
    if(self.select == 'dfcolumn_distr'):
        self.dfcolumn_distr(self.args)
    if(self.select == 'dfcolumn_na'):
        self.dfcolumn_na(self.args)
    if(self.select == 'dfall_na'):
        self.dfall_na(self.args)
    if(self.select == 'show_stats'):
        self.show_statistics(args)
    if(self.select == 'show_info'):
        self.show_info(args)
    if(self.select == 'show_dtypes'):
        self.show_dtypes(args)
    if(self.select == 'show_feats'):
        self.show_features(args)   
    if(self.select == 'show_corr'):
        self.show_correlation(args)

Activation Functions¶

Here you will find the relevant activation functions available in class mpd_talktodata

dfcolumninfo¶

data: `pd.DataFrame` targ:`None`

The method is used to print the dataframe columns

code:

def dfcolumninfo(self,args:dict):
    print(args['data'].columns)

dfsize¶

data: `pd.DataFrame` targ:`None`

The method is used to print the dataframe size

code:

def dfsize(self,args:dict):
    print(args['data'].shape)

dfcolumn_distr¶

data: `pd.DataFrame` targ:`col`|`column`

The method is used to print count the unique dataframe column values using value_counts

code:

def dfcolumn_distr(self,args:dict):
    if(args['column'] != None):
        display(args['data'][args['column']].value_counts())
    elif(args['col'] != None):
        display(args['data'][args['col']].value_counts())
    else:
        print('[note] please specify the column name')

dfcolumn_na¶

data: `pd.DataFrame` targ:`col`|`column`

The method is used to store the missing data rows found in the dataframe column in memory_output

code:

def dfcolumn_na(self,args:dict):

    if(args['column'] != None):
        ls = args['data'][args['column']]
    elif(args['col'] != None):
        ls = args['data'][args['col']]
    else:
        print('[note] please specify the column name')
        ls = None

    if(ls != None):

        # convert series to dataframe
        if(isinstance(ls,pd.DataFrame) == False):
            ls = ls.to_frame()

        print("[note] I've stored the missing rows")
        nlpi.memory_output.append({'data':ls[ls.isna().any(axis=1)]})     

dfall_na¶

data: `pd.DataFrame` targ:`None`

The method is used to print the statistics of the ammount of data missing in all columns & store the missing rows in memory_output

code:

def dfall_na(self,args:dict):

    print(args['data'].isna().sum().sum(),'rows in total have missing data')
    print(args['data'].isna().sum())

    print("[note] I've stored the missing rows")
    ls = args['data']
    nlpi.memory_output.append({'data':ls[ls.isna().any(axis=1)]})  

show_info¶

data: `pd.DataFrame` targ:`None`

Method is used to print a concise summary of a pandas DataFrame. It provides information such as the number of rows and columns, the data types of each column, the memory usage, and the number of non-null values in each column. This method is useful for quickly understanding the structure and content of a DataFrame, especially when working with large datasets. Additionally, it can help identify missing or null values that may need to be addressed in data cleaning or preprocessing.

code:

@staticmethod
def show_info(args:dict):
    print(args['data'].info())

show_missing¶

data: `pd.DataFrame` targ:`None`

Method is used to print a concise summary of a pandas DataFrame. It provides information such as the number of rows and columns, the data types of each column, the memory usage, and the number of non-null values in each column. This method is useful for quickly understanding the structure and content of a DataFrame, especially when working with large datasets. Additionally, it can help identify missing or null values that may need to be addressed in data cleaning or preprocessing.

code:

@staticmethod
def show_missing(args:dict):
    print(args['data'].isna().sum(axis=0))

show_stats¶

data: `pd.DataFrame` targ:`None`

pandas.DataFrame.describe() is a method that provides a summary of the statistical properties of each column in a DataFrame. By default, it calculates the count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum for each numeric column.

code:

@staticmethod
def show_statistics(args:dict):
    display(args['data'].describe())

show_dtypes¶

data: `pd.DataFrame` targ:`None`

Attribute of a pandas DataFrame that returns the data types of each column in the DataFrame. This attribute is useful for understanding the data types of each column and can be used to convert columns to different data types if necessary.

code:

@staticmethod
def show_dtypes(args:dict):
    print(args['data'].dtypes)

show_corr¶

data: `pd.DataFrame` targ:`None`

Method that calculates the correlation between columns in a DataFrame. Correlation is a statistical measure that indicates the degree to which two variables are related

code:

@staticmethod
def show_correlation(args:dict):
    corr_mat = pd.DataFrame(np.round(args['data'].corr(),2),
                         index = list(args['data'].columns),
                         columns = list(args['data'].columns))
    corr_mat = corr_mat.dropna(how='all',axis=0)
    corr_mat = corr_mat.dropna(how='all',axis=1)
    display(corr_mat)

Reference to the sub folder in src ↩

mpd_talktodata

Module Group¶

Project Stage ID¶

Purpose¶

Location¶

module information:

module activation functions:

Requirements¶

Selection¶

Activation Functions¶

dfcolumninfo¶

data: pd.DataFrame targ:None

code:

dfsize¶

data: pd.DataFrame targ:None

code:

dfcolumn_distr¶

data: pd.DataFrame targ:col|column

code:

dfcolumn_na¶

data: pd.DataFrame targ:col|column

code:

dfall_na¶

data: pd.DataFrame targ:None

code:

show_info¶

data: pd.DataFrame targ:None

code:

show_missing¶

data: pd.DataFrame targ:None

code:

show_stats¶

data: pd.DataFrame targ:None

code:

show_dtypes¶

data: pd.DataFrame targ:None

code:

show_corr¶

data: pd.DataFrame targ:None

code:

data: `pd.DataFrame` targ:`None`

data: `pd.DataFrame` targ:`None`

data: `pd.DataFrame` targ:`col`|`column`

data: `pd.DataFrame` targ:`col`|`column`

data: `pd.DataFrame` targ:`None`

data: `pd.DataFrame` targ:`None`

data: `pd.DataFrame` targ:`None`

data: `pd.DataFrame` targ:`None`

data: `pd.DataFrame` targ:`None`

data: `pd.DataFrame` targ:`None`