mllaunchpad package¶
Top-level package for ML Launchpad.
- class mllaunchpad.ModelInterface(contents=None)[source]¶
Bases:
abc.ABC
Abstract model interface for Data-Scientist-created models. Please inherit from this class when creating your model to make it usable for ModelApi.
You don’t need to create this object yourself when training. It is created automatically and the model/info returned from create_trained_model is made accessible to you through the self.contents attribute.
- abstract predict(model_conf, data_sources, data_sinks, model, args_dict)[source]¶
Implement this method, including data prep/feature creation based on argsDict. argsDict can also contain an id which the model can use to fetch data from any data_sources. (Feel free to put common code for preparing data into another function, class, library, …)
- Params:
model_conf: the model configuration dict from the config file data_sources: dict containing the data sources data_sinks: dict containing the data sinks, as configured in the config file. model: your model object (whatever you returned in create_trained_model) argsDict: parameters the API was called with, dict of strings (any type conversion needs to be done by you)
- Return:
Prediction result as a dictionary/list structure which will be automatically turned into JSON.
- class mllaunchpad.ModelMakerInterface[source]¶
Bases:
abc.ABC
Abstract model factory interface for Data-Scientist-created models. Please inherit from this class and put your training code into the method “create_trained_model”. This method will be called by the framework when your model needs to be (re)trained.
- Why not simply use static methods?
We want to make it possible for create_trained_model to pass extra info test_trained_model without extending the latter with optional keyword arguments that might be confusing for the 90% of cases where they are not needed. So we rely on the smarts of the person inheriting from this class to find a solution/shortcuts if they want to do more difficult things e.g. want to do the train/test split themselves.
- abstract create_trained_model(model_conf, data_sources, data_sinks, old_model=None)[source]¶
Implement this method, including data prep/feature creation. No need to test your model here. Put testing code in test_trained_model, which will be called automatically after training. (Feel free to put common code for preparing data into another function, class, library, …)
- Params:
model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when training. old_model: contains an old model, if it exists, which can be used for incremental training. default: None
- Return:
The trained model/data/anything which you want to use in the predict() function. (usually simply a fitted model object, but can be anything, like a dict of several models, a model with some extra info, etc.) (Whatever you return here gets automatically stuffed into your ModelInterface-inherited object and is accessible there using predict’s model parameter (or the self.contents attribute.))
- abstract test_trained_model(model_conf, data_sources, data_sinks, model)[source]¶
Implement this method, including data prep/feature creation. This method will be called to re-test a model, e.g. to check whether it has to be re-trained. (Feel free to put common code for preparing data into another function, class, library, …)
- Params:
model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when testing. model: your model object (whatever you returned in create_trained_model)
- Return:
Return a dict of metrics (like ‘accuracy’, ‘f1’, ‘confusion_matrix’, etc.)
- mllaunchpad.get_validated_config(filename: str = './LAUNCHPAD_CFG.yml') dict [source]¶
Read the configuration from file and return it as a dict object.
- Parameters
filename (optional str, default: environment variable LAUNCHPAD_CFG or file ./LAUNCHPAD_CFG.yml) – Path to configuration file
- Returns
dict with configuration
- Return type
dict
- mllaunchpad.get_validated_config_str(io: Union[AnyStr, TextIO]) dict [source]¶
Read the configuration from a string or open file and return it as a dict object. This function exists mainly for making debugging and unit testing your model’s code easier.
- Parameters
io (str or open text file handle) – Configuration as unicode string or b”byte string” or a open text file to read from
- Returns
configuration
- Return type
dict
- mllaunchpad.list_models(model_store_location_or_config_dict: Union[Dict, str])[source]¶
Get information on all available versions of trained models.
- Parameters
model_store_location_or_config_dict (Union[Dict, str]) – Location of the model store. If you have a config dict available, use that instead.
Side note: The return value includes backups of models that have been re-trained without changing the version number (they reside in the subdirectory
previous
). Please note that these backed up models are just listed for information and are not available for loading (one would have to restore them by moving them up a directory level fromprevious
.Example:
import mllaunchpad as mllp my_cfg = mllp.get_validated_config("./my_config_file.yml") all_models = mllp.list_models(my_cfg) # also accepts model store location string # An example of what a ``list_models()``'s result would look like: # { # iris: { # 1.0.0: { ... complete metadata of this version number ... }, # 1.1.0: { ... }, # latest: { ... duplicate of metadata of highest available version number, here 1.1.0 ... }, # backups: [ {...}, {...}, ... ] # }, # my_other_model: { # 1.0.1: { ... }, # 2.0.0: { ... }, # latest: { ... }, # backups: [] # } # }
- Returns
Dict with information on all available trained models.
- mllaunchpad.order_columns(obj: Union[pandas.core.frame.DataFrame, numpy.ndarray, Dict])[source]¶
Order the columns of a DataFrame, a dict, or a Numpy structured array. Use this on your training data right before passing it into the model. This will guarantee that the model is trained with a reproducible column order.
Same in your test code.
Most importantly, use this also in your predict method, as the incoming args_dict does not have a deterministic order.
- Params:
obj: a DataFrame, a dict, or a Numpy structured array
- Returns:
The obj with columns ordered lexicographically
- mllaunchpad.predict(complete_conf: Dict, arg_dict: Optional[Dict] = None, cache: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None, use_live_code: bool = False)[source]¶
Carry out prediction for the model specified in the configuration.
- Parameters
complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
arg_dict (optional Dict, default: None) – Arguments dict for the prediction (analogous to what it would get from a web API)
model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.
use_live_code (optional bool, default: False) – Use the current predict function instead of the one persisted with the model in the model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.
- Returns
model’s prediction output
- mllaunchpad.report(name: str, value) None ¶
Add a piece of information to the train report during training.
The train report is part of the model’s metadata that is saved to the model store. Use
mllaunchpad.list_models()
to query metadata from the model store.This function is supposed to be called from your
create_trained_model()
ortest_trained_model()
implementation. You can pass any values that are JSON-able, same as withtest_trained_model()
’s returned metrics.However, if the value is a DataFrame, it will be summarized (using pd.describe()). You can use this for example to improve traceability of your trained models and for some basic sanity checks of training data distribution.
- Parameters
name (str) – Key to save the information under (e.g. “meaning_of_life”)
value (str, number, list, dict, Numpy Array or Pandas DataFrame) – Value to save. Any JSON-able value or structure will work. Pandas DataFrames will be summarized instead of saved.
- mllaunchpad.retest(complete_conf: Dict, cache: bool = True, persist: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]¶
Retest a model as specified in the configuration and persist its test metrics in the model store.
- Parameters
complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
persist (optional bool, default: True) – Whether to update the model in model_cache: with the test metrics. This parameter exists mainly for making debugging and unit testing your model’s code easier.
model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.
- Returns
test_metrics
- mllaunchpad.train_model(complete_conf: Dict, cache: bool = True, persist: bool = True, test: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]¶
Train and test a model as specified in the configuration and persist it in the model store.
- Parameters
complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
persist (optional bool, default: True) – Whether to store the trained model in the location configured by model_cache:. This parameter exists mainly for making debugging and unit testing your model’s code easier.
test (optional bool, default: True) – Whether to test the model after training. This parameter exists mainly for making debugging and unit testing your model’s code easier.
model (optional object implementing ModelInterface, default: None) – Use this model as previous model instead trying to load it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.
- Returns
Tuple of (object implementing ModelInterface, metrics)
Submodules¶
mllaunchpad.api module¶
This module contains functionality for generic creation and handling of RESTful APIs for Machine Learning Models. Among others, it handles parsing the RAML definition, and validating parameters.
- class mllaunchpad.api.GetByIdResource(model_api_obj, parser, id_name)[source]¶
Bases:
flask_restful.Resource
- methods: Optional[List[str]] = {'GET'}¶
A list of methods this view can handle.
- class mllaunchpad.api.ModelApi(config, application, debug=False)[source]¶
Bases:
object
Class to plug a Data-Scientist-created model into.
This class handles the heavy lifting of APIs for the model.
The model is a delegate which inherits from (=implements) ModelInterface. It needs to provide a predict function.
For details, see the documentation in the module model_interface
- class mllaunchpad.api.QueryOrFileUploadResource(model_api_obj, query_parser=None, file_parser=None)[source]¶
Bases:
flask_restful.Resource
- methods: Optional[List[str]] = {'GET', 'POST'}¶
A list of methods this view can handle.
- class mllaunchpad.api.QueryResource(model_api_obj, parser)[source]¶
Bases:
flask_restful.Resource
- methods: Optional[List[str]] = {'GET'}¶
A list of methods this view can handle.
mllaunchpad.cli module¶
This module provides the command line interface for ML Launchpad
mllaunchpad.config module¶
This module contains functionality for reading and validating the configuration.
- mllaunchpad.config.get_validated_config(filename: str = './LAUNCHPAD_CFG.yml') dict [source]¶
Read the configuration from file and return it as a dict object.
- Parameters
filename (optional str, default: environment variable LAUNCHPAD_CFG or file ./LAUNCHPAD_CFG.yml) – Path to configuration file
- Returns
dict with configuration
- Return type
dict
- mllaunchpad.config.get_validated_config_str(io: Union[AnyStr, TextIO]) dict [source]¶
Read the configuration from a string or open file and return it as a dict object. This function exists mainly for making debugging and unit testing your model’s code easier.
- Parameters
io (str or open text file handle) – Configuration as unicode string or b”byte string” or a open text file to read from
- Returns
configuration
- Return type
dict
mllaunchpad.datasources module¶
- class mllaunchpad.datasources.FileDataSink(identifier: str, datasink_config: Dict)[source]¶
Bases:
mllaunchpad.resource.DataSink
DataSink for putting data into files.
See
serves
for the available types.Configuration example:
datasinks: # ... (other datasinks) my_datasink: type: euro_csv # `euro_csv` changes separators to ";" and decimals to "," w.r.t. `csv` path: /some/file.csv # Can be URL, uses `df.to_csv` internally tags: [train] # generic parameter, see documentation on DataSources and DataSinks options: {} # used as **kwargs when fetching the data using `df.to_csv` dtypes_path: ./some/file.dtypes # optional: location for saving the csv's column dtypes info my_raw_datasink: type: text_file # raw files can also be of type `binary_file` path: /some/file.txt # Can be URL tags: [train] # generic parameter, see documentation on DataSources and DataSinks options: {} # used as **kwargs when writing the data using `fh.write`
When saving csv or euro_csv type formats, you can use the setting dtypes_path to specify a location where to save dtypes descriptions for the csv (that you can use later with
FileDataSource
’s dtypes_path setting). These dtypes will be enforced when reading the csv, which helps avoid problems when pandas.read_csv interprets data differently than you do. Use dtypes_path to enforce dtype parity between csv datasinks and datasources.Using the raw formats binary_file and text_file, you can persist arbitrary data, as long as it can be represented as a bytes or a str object, respectively. text_file uses UTF-8 encoding. Please note that while possible, it is not recommended to persist DataFrame`s this way, because by adding format-specific code to your model, you’re giving up your code’s independence from the type of `DataSource/DataSink. Here’s an example for pickling an arbitrary object:
# config fragment: datasinks: # ... my_pickle_datasink: type: binary_file path: /some/file.pickle tags: [train] options: {} # code fragment: import pickle # ... # in predict/test/train code: my_pickle = pickle.dumps(my_object) data_sinks["my_pickle_datasink"].put_raw(my_pickle)
- put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) None [source]¶
Write a pandas dataframe to file and optionally the dtypes if included in the configuration. The default is not to save the dataframe’s row index. Configure the DataSink’s options dict to pass keyword arguments to my_df.to_csv. If the directory path leading to the file does not exist, it will be created.
Example:
data_sinks["my_datasink"].put_dataframe(my_df)
- Parameters
dataframe (pandas DataFrame) – The pandas dataframe to save
params (optional dict) – Currently not implemented
chunksize (optional bool) – Currently not implemented
- put_raw(raw_data: Union[str, bytes], params: Optional[Dict] = None, chunksize: Optional[int] = None) None [source]¶
Write raw (unstructured) data to file. If the directory path leading to the file does not exist, it will be created.
Example:
data_sinks["my_raw_datasink"].put_raw(my_data)
- Parameters
raw_data (bytes or str) – The data to save (bytes for binary, string for text file)
params (optional dict) – Currently not implemented
chunksize (optional bool) – Currently not implemented
- serves: List[str] = ['csv', 'euro_csv', 'text_file', 'binary_file']¶
- class mllaunchpad.datasources.FileDataSource(identifier: str, datasource_config: Dict)[source]¶
Bases:
mllaunchpad.resource.DataSource
DataSource for fetching data from files.
See
serves
for the available types.Configuration example:
datasources: # ... (other datasources) my_datasource: type: euro_csv # `euro_csv` changes separators to ";" and decimals to "," w.r.t. `csv` path: /some/file.csv # Can be URL, uses `pandas.read_csv` internally expires: 0 # generic parameter, see documentation on DataSources tags: [train] # generic parameter, see documentation on DataSources and DataSinks options: {} # used as **kwargs when fetching the data using `pandas.read_csv` dtypes_path: ./some/file.dtypes # optional: location with the csv's column dtypes info my_raw_datasource: type: text_file # raw files can also be of type `binary_file` path: /some/file.txt # Can be URL expires: 0 # generic parameter, see documentation on DataSources tags: [train] # generic parameter, see documentation on DataSources and DataSinks options: {} # used as **kwargs when fetching the data using `fh.read`
When loading csv or euro_csv type formats, you can use the setting dtypes_path to specify a location with dtypes description for the csv (usually generated earlier by using
FileDataSink
’s dtypes_path setting). These dtypes will be enforced when reading the csv, which helps avoid problems when pandas.read_csv interprets data differently than you do. Use dtypes_path to enforce dtype parity between csv datasinks and datasources.Using the raw formats binary_file and text_file, you can read arbitrary data, as long as it can be represented as a bytes or a str object, respectively. text_file uses UTF-8 encoding. Please note that while possible, it is not recommended to persist DataFrame`s this way, because by adding format-specific code to your model, you’re giving up your code’s independence from the type of `DataSource/DataSink. Here’s an example for unpickling an arbitrary object:
# config fragment: datasources: # ... my_pickle_datasource: type: binary_file path: /some/file.pickle tags: [train] options: {} # code fragment: import pickle # ... # in predict/test/train code: my_pickle = data_sources["my_pickle_datasource"].get_raw() my_object = pickle.loads(my_pickle)
- get_dataframe(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶
Get data as a pandas dataframe.
Example:
data_sources["my_datasource"].get_dataframe()
- Parameters
params (optional dict) – Currently not implemented
chunksize (optional int) – Return an iterator where chunksize is the number of rows to include in each chunk.
- Returns
DataFrame object, possibly cached according to config value of expires:
- get_raw(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶
Get data as raw (unstructured) data.
Example:
data_sources["my_raw_datasource"].get_raw()
- Parameters
params (optional dict) – Currently not implemented
chunksize (optional bool) – Currently not implemented
- Returns
The file’s bytes (binary) or string (text) contents, possibly cached according to config value of expires:
- Return type
bytes or str
- serves: List[str] = ['csv', 'euro_csv', 'text_file', 'binary_file']¶
- class mllaunchpad.datasources.OracleDataSink(identifier: str, datasink_config: Dict, dbms_config: Dict)[source]¶
Bases:
mllaunchpad.resource.DataSink
DataSink for Oracle database connections.
Creates a long-living connection on initialization.
Configuration example:
dbms: # ... (other connections) my_connection: # NOTE: You can use the same connection for several datasources and datasinks type: oracle host: host.example.com port: 1251 user_var: MY_USER_ENV_VAR password_var: MY_PW_ENV_VAR # optional service_name: servicename.example.com options: {} # used as **kwargs when initializing the DB connection # ... datasinks: # ... (other datasinks) my_datasink: type: dbms.my_connection table: somewhere.my_table tags: [train] # generic parameter, see documentation on DataSources and DataSinks options: {} # used as **kwargs when storing the table using `my_df.to_sql`
- put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) None [source]¶
Store the pandas dataframe as a table. The default is not to store the dataframe’s row index. Configure the DataSink’s options dict to pass keyword arguments to df.to_sql.
Example:
data_sinks["my_datasink"].put_dataframe(my_df)
- Parameters
dataframe (pandas DataFrame) – The pandas dataframe to store
params (optional dict) – Currently not implemented
chunksize (optional bool) – Currently not implemented
- put_raw(raw_data, params: Optional[Dict] = None, chunksize: Optional[int] = None) None [source]¶
Not implemented.
- Raises
NotImplementedError – Raw/blob format currently not supported.
- serves: List[str] = ['dbms.oracle']¶
- class mllaunchpad.datasources.OracleDataSource(identifier: str, datasource_config: Dict, dbms_config: Dict)[source]¶
Bases:
mllaunchpad.resource.DataSource
DataSource for Oracle database connections.
Creates a long-living connection on initialization.
Configuration example:
dbms: # ... (other connections) my_connection: # NOTE: You can use the same connection for several datasources and datasinks type: oracle host: host.example.com port: 1251 user_var: MY_USER_ENV_VAR password_var: MY_PW_ENV_VAR # optional service_name: servicename.example.com options: {} # used as **kwargs when initializing the DB connection # ... datasources: # ... (other datasources) my_datasource: type: dbms.my_connection query: SELECT * FROM somewhere.my_table where id = :id # fill `:params` by calling `get_dataframe` with a `dict` expires: 0 # generic parameter, see documentation on DataSources tags: [train] # generic parameter, see documentation on DataSources and DataSinks options: {} # used as **kwargs when fetching the query using `pandas.read_sql`
- get_dataframe(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶
Get the data as pandas dataframe.
Null values are replaced by
numpy.nan
.Example:
data_sources["my_datasource"].get_dataframe({"id": 387})
- Parameters
params (optional dict) – Query parameters to fill in query (e.g. replace query’s :id parameter with value 387)
chunksize (optional int) – Return an iterator where chunksize is the number of rows to include in each chunk.
- Returns
DataFrame object, possibly cached according to config value of expires:
- get_raw(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶
Not implemented.
- Raises
NotImplementedError – Raw/blob format currently not supported.
- serves: List[str] = ['dbms.oracle']¶
- class mllaunchpad.datasources.SqlDataSink(identifier: str, datasource_config: Dict, dbms_config: Dict)[source]¶
Bases:
mllaunchpad.resource.DataSink
DataSink for RedShift, Postgres, MySQL, SQLite, Oracle, Microsoft SQL (ODBC), and their dialects.
Uses SQLAlchemy under the hood, and as such, manages a connection pool automatically.
Please configure the
dbms:<name>:connection_string:
, which is a standard RFC-1738 URL with the syntaxdialect[+driver]://[user:password@][host]/[dbname][?key=value..]
. The exact URL is specific for the database you want to connect to. Find examples for all supported database dialects here.Depending on the dialect you want to use, you might need to install additional drivers and packages. For example, for connecting to a kerberized Impala instance via ODBC, you need to:
Install Impala ODBC drivers for your OS,
pip install winkerberos thrift_sasl pyodbc sqlalchemy
# use pykerberos for non-windows systems
If you are tasked with connecting to a particular database system, and don’t know where to start, researching on how to connect to it from SQLAlchemy will serve as a good starting point.
Other configuration in the
dbms:
section (besidesconnection_string:
) is optional, but can be provided if deemed necessary:Any
dbms:
-level settings other thantype:
,connection_string:
andoptions:
will be passed as additional keyword arguments to SQLAlchemy’s create_engine.Any key-value pairs inside
dbms:<name>:options: {}
will be passed to SQLAlchemy as connect_args. If you append_var
to the end of an argument key, its value will be interpreted as an environment variable name which ML Launchpad will attempt to get a value from. This can be useful for information like passwords which you do not want to store in the configuration file.
Configuration example:
dbms: # ... (other connections) # Example for connecting to a kerberized Impala instance via ODBC: my_connection: # NOTE: You can use the same connection for several datasources and datasinks type: sql connection_string: mssql+pyodbc:///default?&driver=Cloudera+ODBC+Driver+for+Impala&host=servername.somedomain.com&port=21050&authmech=1&krbservicename=impala&ssl=1&usesasl=1&ignoretransactions=1&usesystemtruststore=1 # pyodbc alternative: mssql+pyodbc:///?odbc_connect=DRIVER%3D%7BCloudera+ODBC+Driver+for+Impala%7D%3BHOST%3Dservername.somedomain.com%3BPORT%3D21050%3BAUTHMECH%3D1%3BKRBSERVICENAME%3Dimpala%3BSSL%3D1%3BUSESASL%3D1%3BIGNORETRANSACTIONS%3D1%3BUSESYSTEMTRUSTSTORE%3D1 echo: True # example for an additional SQLAlchemy keyword argument (logs the SQL) -- these are optional options: {} # used as `connect_args` when creating the SQLAlchemy engine # ... datasinks: # ... (other datasinks) my_datasink: type: dbms.my_connection table: somewhere.my_table tags: [train] # generic parameter, see documentation on DataSources and DataSinks options: {} # used as **kwargs when storing the table using `my_df.to_sql`
- put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) None [source]¶
Store the pandas dataframe as a table. The default is not to store the dataframe’s row index. Configure the DataSink’s options dict to pass keyword arguments to df.to_sql.
Example:
data_sinks["my_datasink"].put_dataframe(my_df)
- Parameters
dataframe (pandas DataFrame) – The pandas dataframe to store
params (optional dict) – Currently not implemented
chunksize (optional bool) – Currently not implemented
- put_raw(raw_data, params: Optional[Dict] = None, chunksize: Optional[int] = None) None [source]¶
Not implemented.
- Raises
NotImplementedError – Raw/blob format currently not supported.
- serves: List[str] = ['dbms.sql']¶
- class mllaunchpad.datasources.SqlDataSource(identifier: str, datasource_config: Dict, dbms_config: Dict)[source]¶
Bases:
mllaunchpad.resource.DataSource
DataSource for RedShift, Postgres, MySQL, SQLite, Oracle, Microsoft SQL (ODBC), and their dialects.
Uses SQLAlchemy under the hood, and as such, manages a connection pool automatically.
Please configure the
dbms:<name>:connection_string:
, which is a standard RFC-1738 URL with the syntaxdialect[+driver]://[user:password@][host]/[dbname][?key=value..]
. The exact URL is specific for the database you want to connect to. Find examples for all supported database dialects here.Depending on the dialect you want to use, you might need to install additional drivers and packages. For example, for connecting to a kerberized Impala instance via ODBC, you need to:
Install Impala ODBC drivers for your OS,
pip install winkerberos thrift_sasl pyodbc sqlalchemy
# use pykerberos for non-windows systems
If you are tasked with connecting to a particular database system, and don’t know where to start, researching on how to connect to it from SQLAlchemy will serve as a good starting point.
Other configuration in the
dbms:
section (besidesconnection_string:
) is optional, but can be provided if deemed necessary:Any
dbms:
-level settings other thantype:
,connection_string:
andoptions:
will be passed as additional keyword arguments to SQLAlchemy’s create_engine.Any key-value pairs inside
dbms:<name>:options: {}
will be passed to SQLAlchemy as connect_args. If you append_var
to the end of an argument key, its value will be interpreted as an environment variable name which ML Launchpad will attempt to get a value from. This can be useful for information like passwords which you do not want to store in the configuration file.
Configuration example:
dbms: # ... (other connections) # Example for connecting to a kerberized Impala instance via ODBC: my_connection: # NOTE: You can use the same connection for several datasources and datasinks type: sql connection_string: mssql+pyodbc:///default?&driver=Cloudera+ODBC+Driver+for+Impala&host=servername.somedomain.com&port=21050&authmech=1&krbservicename=impala&ssl=1&usesasl=1&ignoretransactions=1&usesystemtruststore=1 # pyodbc alternative: mssql+pyodbc:///?odbc_connect=DRIVER%3D%7BCloudera+ODBC+Driver+for+Impala%7D%3BHOST%3Dservername.somedomain.com%3BPORT%3D21050%3BAUTHMECH%3D1%3BKRBSERVICENAME%3Dimpala%3BSSL%3D1%3BUSESASL%3D1%3BIGNORETRANSACTIONS%3D1%3BUSESYSTEMTRUSTSTORE%3D1 echo: True # example for an additional SQLAlchemy keyword argument (logs the SQL) -- these are optional options: {} # used as `connect_args` when creating the SQLAlchemy engine # ... datasources: # ... (other datasources) my_datasource: type: dbms.my_connection query: SELECT * FROM somewhere.my_table WHERE id = :id # fill `:params` by calling `get_dataframe` with a `dict` expires: 0 # generic parameter, see documentation on DataSources tags: [train] # generic parameter, see documentation on DataSources and DataSinks options: {} # used as **kwargs when fetching the query using `pandas.read_sql`
- get_dataframe(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶
Get the data as pandas dataframe.
Null values are replaced by
numpy.nan
.Example:
my_df = data_sources["my_datasource"].get_dataframe({"id": 387})
- Parameters
params (optional dict) – Query parameters to fill in query (e.g. replace query’s :id parameter with value 387)
chunksize (optional int) – Return an iterator where chunksize is the number of rows to include in each chunk.
- Returns
DataFrame object, possibly cached according to config value of expires:
- get_raw(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶
Not implemented.
- Raises
NotImplementedError – Raw/blob format currently not supported.
- serves: List[str] = ['dbms.sql']¶
mllaunchpad.logutil module¶
mllaunchpad.model_actions module¶
Convenience functions for executing training, testing and prediction
- mllaunchpad.model_actions.predict(complete_conf: Dict, arg_dict: Optional[Dict] = None, cache: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None, use_live_code: bool = False)[source]¶
Carry out prediction for the model specified in the configuration.
- Parameters
complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
arg_dict (optional Dict, default: None) – Arguments dict for the prediction (analogous to what it would get from a web API)
model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.
use_live_code (optional bool, default: False) – Use the current predict function instead of the one persisted with the model in the model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.
- Returns
model’s prediction output
- mllaunchpad.model_actions.retest(complete_conf: Dict, cache: bool = True, persist: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]¶
Retest a model as specified in the configuration and persist its test metrics in the model store.
- Parameters
complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
persist (optional bool, default: True) – Whether to update the model in model_cache: with the test metrics. This parameter exists mainly for making debugging and unit testing your model’s code easier.
model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.
- Returns
test_metrics
- mllaunchpad.model_actions.train_model(complete_conf: Dict, cache: bool = True, persist: bool = True, test: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]¶
Train and test a model as specified in the configuration and persist it in the model store.
- Parameters
complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
persist (optional bool, default: True) – Whether to store the trained model in the location configured by model_cache:. This parameter exists mainly for making debugging and unit testing your model’s code easier.
test (optional bool, default: True) – Whether to test the model after training. This parameter exists mainly for making debugging and unit testing your model’s code easier.
model (optional object implementing ModelInterface, default: None) – Use this model as previous model instead trying to load it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.
- Returns
Tuple of (object implementing ModelInterface, metrics)
mllaunchpad.model_interface module¶
- class mllaunchpad.model_interface.ModelInterface(contents=None)[source]¶
Bases:
abc.ABC
Abstract model interface for Data-Scientist-created models. Please inherit from this class when creating your model to make it usable for ModelApi.
You don’t need to create this object yourself when training. It is created automatically and the model/info returned from create_trained_model is made accessible to you through the self.contents attribute.
- abstract predict(model_conf, data_sources, data_sinks, model, args_dict)[source]¶
Implement this method, including data prep/feature creation based on argsDict. argsDict can also contain an id which the model can use to fetch data from any data_sources. (Feel free to put common code for preparing data into another function, class, library, …)
- Params:
model_conf: the model configuration dict from the config file data_sources: dict containing the data sources data_sinks: dict containing the data sinks, as configured in the config file. model: your model object (whatever you returned in create_trained_model) argsDict: parameters the API was called with, dict of strings (any type conversion needs to be done by you)
- Return:
Prediction result as a dictionary/list structure which will be automatically turned into JSON.
- class mllaunchpad.model_interface.ModelMakerInterface[source]¶
Bases:
abc.ABC
Abstract model factory interface for Data-Scientist-created models. Please inherit from this class and put your training code into the method “create_trained_model”. This method will be called by the framework when your model needs to be (re)trained.
- Why not simply use static methods?
We want to make it possible for create_trained_model to pass extra info test_trained_model without extending the latter with optional keyword arguments that might be confusing for the 90% of cases where they are not needed. So we rely on the smarts of the person inheriting from this class to find a solution/shortcuts if they want to do more difficult things e.g. want to do the train/test split themselves.
- abstract create_trained_model(model_conf, data_sources, data_sinks, old_model=None)[source]¶
Implement this method, including data prep/feature creation. No need to test your model here. Put testing code in test_trained_model, which will be called automatically after training. (Feel free to put common code for preparing data into another function, class, library, …)
- Params:
model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when training. old_model: contains an old model, if it exists, which can be used for incremental training. default: None
- Return:
The trained model/data/anything which you want to use in the predict() function. (usually simply a fitted model object, but can be anything, like a dict of several models, a model with some extra info, etc.) (Whatever you return here gets automatically stuffed into your ModelInterface-inherited object and is accessible there using predict’s model parameter (or the self.contents attribute.))
- abstract test_trained_model(model_conf, data_sources, data_sinks, model)[source]¶
Implement this method, including data prep/feature creation. This method will be called to re-test a model, e.g. to check whether it has to be re-trained. (Feel free to put common code for preparing data into another function, class, library, …)
- Params:
model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when testing. model: your model object (whatever you returned in create_trained_model)
- Return:
Return a dict of metrics (like ‘accuracy’, ‘f1’, ‘confusion_matrix’, etc.)
mllaunchpad.resource module¶
- class mllaunchpad.resource.CachedDataSource(name, bases, dct)[source]¶
Bases:
type
Metaclass to Auto-apply decorators “@cached” to data getters. https://stackoverflow.com/questions/10067262/automatically-decorating-every-instance-method-in-a-class
- class mllaunchpad.resource.DataSink(identifier: str, datasink_config: Dict, sub_config: Optional[Dict] = None)[source]¶
Bases:
object
Interface, used by the Data Scientist’s model to persist data (usually prediction results). Concrete DataSinks (for files, data bases, etc.) need to inherit from this class.
- abstract put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) None [source]¶
- abstract put_raw(raw_data: Union[str, bytes], params: Optional[Dict] = None, chunksize: Optional[int] = None) None [source]¶
- serves: List[str] = []¶
- class mllaunchpad.resource.DataSource(identifier: str, datasource_config: Dict, sub_config: Optional[Dict] = None)[source]¶
Bases:
object
Interface, used by the Data Scientist’s model to get its data from. Concrete DataSources (for files, data bases, etc.) need to inherit from this class.
- serves: List[str] = []¶
- class mllaunchpad.resource.ModelStore(config: Union[Dict, str])[source]¶
Bases:
object
Deals with persisting, loading, updating metrics metadata of models. Abstracts away how and where the model is kept.
TODO: Smarter querying like ‘get me the model with the currently (next) best metrics which serves a particular API.’
- dump_trained_model(complete_conf, model, metrics)[source]¶
Save a model object in the model store. Some metadata will also be saved along the model, including the metrics which is the second parameter.
- Params:
model_conf: the config dict of our model model: the model object to store metrics: metrics dictionary
- Returns:
Nothing
- list_models()[source]¶
Get information on all available versions of trained models.
Side note: This also includes backups of models that have been re-trained without changing the version number (they reside in the subdirectory
previous
). Please note that these backed up models are just listed for information and are not available for loading (one would have to restore them by moving them up a directory level fromprevious
.Example:
ms = mllp.ModelStore("./model_store") all_models = ms.list_models() # An example of what a ``list_models()``'s result would look like: { iris: { 1.0.0: { ... complete metadata of this version number ... }, 1.1.0: { ... }, latest: { ... duplicate of metadata of highest available version number, here 1.1.0 ... }, backups: [ {...}, {...}, ... ] }, my_other_model: { 1.0.1: { ... }, 2.0.0: { ... }, latest: { ... }, backups: [] } }
- Returns
Dict with information on all available trained models.
- mllaunchpad.resource.create_data_sources_and_sinks(config: Dict, tags: Optional[Iterable[str]] = None) Tuple[Dict[str, mllaunchpad.resource.DataSource], Dict[str, mllaunchpad.resource.DataSink]] [source]¶
Creates the data sources as defined in the configuration dict. Filters them by tag.
- Params:
config: configuration dictionary tags: optionally filter for only matching datasources no value(s) = match all datasources
- Returns:
dict with keys=datasource names, values=initialized DataSource objects
- mllaunchpad.resource.get_user_pw(user_var: str, password_var: str) Tuple[str, Optional[str]] [source]¶
- mllaunchpad.resource.order_columns(obj: Union[pandas.core.frame.DataFrame, numpy.ndarray, Dict])[source]¶
Order the columns of a DataFrame, a dict, or a Numpy structured array. Use this on your training data right before passing it into the model. This will guarantee that the model is trained with a reproducible column order.
Same in your test code.
Most importantly, use this also in your predict method, as the incoming args_dict does not have a deterministic order.
- Params:
obj: a DataFrame, a dict, or a Numpy structured array
- Returns:
The obj with columns ordered lexicographically
mllaunchpad.yaml_loader module¶
- class mllaunchpad.yaml_loader.SafeIncludeLoader(stream)[source]¶
Bases:
yaml.loader.SafeLoader
A subclass of SafeLoader which supports !include file references.
- yaml_constructors = {'tag:yaml.org,2002:null': <function SafeConstructor.construct_yaml_null>, 'tag:yaml.org,2002:bool': <function SafeConstructor.construct_yaml_bool>, 'tag:yaml.org,2002:int': <function SafeConstructor.construct_yaml_int>, 'tag:yaml.org,2002:float': <function SafeConstructor.construct_yaml_float>, 'tag:yaml.org,2002:binary': <function SafeConstructor.construct_yaml_binary>, 'tag:yaml.org,2002:timestamp': <function SafeConstructor.construct_yaml_timestamp>, 'tag:yaml.org,2002:omap': <function SafeConstructor.construct_yaml_omap>, 'tag:yaml.org,2002:pairs': <function SafeConstructor.construct_yaml_pairs>, 'tag:yaml.org,2002:set': <function SafeConstructor.construct_yaml_set>, 'tag:yaml.org,2002:str': <function SafeConstructor.construct_yaml_str>, 'tag:yaml.org,2002:seq': <function SafeConstructor.construct_yaml_seq>, 'tag:yaml.org,2002:map': <function SafeConstructor.construct_yaml_map>, None: <function SafeConstructor.construct_undefined>, '!include': <function SafeIncludeLoader.include>}¶