Storage
Data Storage¶
To utilise automisation with mllibs, data needs to be stored in nlpi instances. Input data is allocated a name tag (or key), allowing the user to reference the data source in a query so it can be used as an input into the selected activation function
Preset Datasets¶
Preset datasets are a quick way to load load_sample_data()
def load_sample_data(self):
self.store_data(sns.load_dataset('flights'),'flights')
self.store_data(sns.load_dataset('penguins'),'penguins')
self.store_data(sns.load_dataset('taxis'),'taxis')
self.store_data(sns.load_dataset('titanic'),'titanic')
self.store_data(sns.load_dataset('mpg'),'mpg')
once the nlpi instance has been created, you can store all the above data in i.data and reference to them by their allocated name, shown above
c = nlpm()
c.load([
eda_splot(), # [eda] standard seaborn plots
eda_scplot(), # [eda] seaborn column plots
stats_tests(), # [stats] statistical tests for list data
stats_plot(), # [stats] ploty and compare statistical distributions
libop_general(), # [library] mllibs related functionality
pd_talktodata(), # [eda] pandas data exploration
fourier_all() # [signal] fast fourier transformation related
])
c.setup()
i = nlpi(c) # create an instance of nlpi
i.load_sample_data() # load preset datasets
i.data.keys()
Having loaded the data, you will have access to them when making some text requests!
Loading Own Data¶
To load your own data and reference your own data, you need to use i.store_data method. At present only two formats are used as storage types python lists and pandas dataframes. They can be imported directly
Or as part of a dictionary input:
Example¶
For example, load dataframe data from the desired souce and name it something relevant:
df = pd.read_csv('https://raw.githubusercontent.com/shtrausslearning/Data-Science-Portfolio/main/sources/stocks.csv',delimiter=',')
i.store_data({'stocks':df})
When wanting to use the data, simply use its reference name
Or some python list data, and give them relevant names which will be used to reference this data:
store data
sample1 = list(np.random.normal(scale=1, size=1000))
sample2 = list(np.random.normal(scale=1, size=1000))
i.store_data({'distribution_A':sample1,
'distribution_B':sample2})
An example when you want to compare both datasets:
Active Columns¶
When using natural language for automation, specifying a subset of a dataframe in a single query can make them them quite long. As a result, the use of active columns or simply put defined subset column lists is utilised in mllibs and can be defined by setting.
An important distinguish to note is that active columns are not data souces, they are stored in the existing data dictionary under the key ac (see Data Extraction)
Its usage is quite standard:
- First, specify for which data you want to store some subset of column names as active column names "data_name"
- Give the active column some name, which will allow you to reference the particular columns
- Specify a python list of strings which with the names of the columns of the dataframe
Example¶
For example, we have dataset penguins, for which we want to reference two columns bill_length_mm,bill_depth_mm as "selected_columns"
We can do this by calling the store_ac method
i.store_ac('penguins', # data name
'selected_columns', # active column reference name
['bill_length_mm','bill_depth_mm'] # column names that make up active column name
)
Confirm, we have stored selected_columns into penguins:
Sample Requests
An example, referencing the active column name in a request:
Data Extraction¶
If you have the need to extract data related content, you can call i.data
DataFrame Storage¶
nlpi stores a variety of data related to DataFrames, the stored content changes depending on the implemented activation functions, here's an example: