Storage

Data Storage¶

To utilise automisation with mllibs, data needs to be stored in nlpi instances. Input data is allocated a name tag (or key), allowing the user to reference the data source in a query so it can be used as an input into the selected activation function

Preset Datasets¶

Preset datasets are a quick way to load load_sample_data()

def load_sample_data(self):
    self.store_data(sns.load_dataset('flights'),'flights')
    self.store_data(sns.load_dataset('penguins'),'penguins')
    self.store_data(sns.load_dataset('taxis'),'taxis')
    self.store_data(sns.load_dataset('titanic'),'titanic')
    self.store_data(sns.load_dataset('mpg'),'mpg')

once the nlpi instance has been created, you can store all the above data in i.data and reference to them by their allocated name, shown above

c = nlpm()
c.load([
         eda_splot(),     # [eda] standard seaborn plots
         eda_scplot(),    # [eda] seaborn column plots
         stats_tests(),   # [stats] statistical tests for list data
         stats_plot(),    # [stats] ploty and compare statistical distributions
         libop_general(), # [library] mllibs related functionality
         pd_talktodata(), # [eda] pandas data exploration 
         fourier_all()    # [signal] fast fourier transformation related
        ])

c.setup()
i = nlpi(c)               # create an instance of nlpi
i.load_sample_data()      # load preset datasets
i.data.keys()

dict_keys(['flights', 'penguins', 'taxis', 'titanic', 'dmpg', 'stocks'])

Having loaded the data, you will have access to them when making some text requests!

Loading Own Data¶

To load your own data and reference your own data, you need to use i.store_data method. At present only two formats are used as storage types python lists and pandas dataframes. They can be imported directly

nlpi.store_data(data:(list or pd.DataFrame),'name')

Or as part of a dictionary input:

i.store_data(data:{'name1':list,'name2':pd.DataFrame})

Example¶

For example, load dataframe data from the desired souce and name it something relevant:

df = pd.read_csv('https://raw.githubusercontent.com/shtrausslearning/Data-Science-Portfolio/main/sources/stocks.csv',delimiter=',')
i.store_data({'stocks':df})

When wanting to use the data, simply use its reference name

i['show the dataframe information for stocks']

Or some python list data, and give them relevant names which will be used to reference this data:

store data
sample1 = list(np.random.normal(scale=1, size=1000))
sample2 = list(np.random.normal(scale=1, size=1000))
i.store_data({'distribution_A':sample1,
              'distribution_B':sample2})

An example when you want to compare both datasets:

 i['comapare histograms of samples distribution_B distribution_A']

Active Columns¶

When using natural language for automation, specifying a subset of a dataframe in a single query can make them them quite long. As a result, the use of active columns or simply put defined subset column lists is utilised in mllibs and can be defined by setting.

An important distinguish to note is that active columns are not data souces, they are stored in the existing data dictionary under the key ac (see Data Extraction)

i.store_ac('data_name','active_column name',['column A','column B'])

Its usage is quite standard:

First, specify for which data you want to store some subset of column names as active column names "data_name"

Give the active column some name, which will allow you to reference the particular columns

Specify a python list of strings which with the names of the columns of the dataframe

Example¶

For example, we have dataset penguins, for which we want to reference two columns bill_length_mm,bill_depth_mm as "selected_columns"

We can do this by calling the store_ac method

i.store_ac('penguins',                        # data name
           'selected_columns',                # active column reference name
           ['bill_length_mm','bill_depth_mm'] # column names that make up active column name
           )

Confirm, we have stored selected_columns into penguins:

i.data['penguins']['ac']
{'selected_columns': ['bill_length_mm', 'bill_depth_mm']}

Sample Requests

An example, referencing the active column name in a request:

i['using data penguins create a relplot using columns selected_columns set hue as island']

Data Extraction¶

If you have the need to extract data related content, you can call i.data

DataFrame Storage¶

nlpi stores a variety of data related to DataFrames, the stored content changes depending on the implemented activation functions, here's an example:

i.data['stocks']

{'data':            date      GOOG      AAPL      AMZN        FB      NFLX      MSFT
 0    2018-01-01  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000
 1    2018-01-08  1.018172  1.011943  1.061881  0.959968  1.053526  1.015988
 2    2018-01-15  1.032008  1.019771  1.053240  0.970243  1.049860  1.020524
 3    2018-01-22  1.066783  0.980057  1.140676  1.016858  1.307681  1.066561
 4    2018-01-29  1.008773  0.917143  1.163374  1.018357  1.273537  1.040708
 ..          ...       ...       ...       ...       ...       ...       ...
 100  2019-12-02  1.216280  1.546914  1.425061  1.075997  1.463641  1.720717
 101  2019-12-09  1.222821  1.572286  1.432660  1.038855  1.421496  1.752239
 102  2019-12-16  1.224418  1.596800  1.453455  1.104094  1.604362  1.784896
 103  2019-12-23  1.226504  1.656000  1.521226  1.113728  1.567170  1.802472
 104  2019-12-30  1.213014  1.678000  1.503360  1.098475  1.540883  1.788185

 [105 rows x 7 columns],
 'subset': None,
 'splits': {},
 'splits_col': {},
 'features': ['date', 'GOOG', 'AAPL', 'AMZN', 'FB', 'NFLX', 'MSFT'],
 'target': None,
 'cat': ['date'],
 'num': ['GOOG', 'AAPL', 'AMZN', 'FB', 'NFLX', 'MSFT'],
 'miss': False,
 'size': 105,
 'dim': 7,
 'model_prediction': {},
 'model_correct': {},
 'model_error': {},
 'ac': {},
 'ft': None,
 'outliers': {},
 'dimred': {}}