mllaunchpad package

Top-level package for ML Launchpad.

class mllaunchpad.ModelInterface(contents=None)[source]

Bases: abc.ABC

Abstract model interface for Data-Scientist-created models. Please inherit from this class when creating your model to make it usable for ModelApi.

You don’t need to create this object yourself when training. It is created automatically and the model/info returned from create_trained_model is made accessible to you through the self.contents attribute.

abstract predict(model_conf, data_sources, data_sinks, model, args_dict)[source]

Implement this method, including data prep/feature creation based on argsDict. argsDict can also contain an id which the model can use to fetch data from any data_sources. (Feel free to put common code for preparing data into another function, class, library, …)

Params:

model_conf: the model configuration dict from the config file data_sources: dict containing the data sources data_sinks: dict containing the data sinks, as configured in the config file. model: your model object (whatever you returned in create_trained_model) argsDict: parameters the API was called with, dict of strings (any type conversion needs to be done by you)

Return:

Prediction result as a dictionary/list structure which will be automatically turned into JSON.

class mllaunchpad.ModelMakerInterface[source]

Bases: abc.ABC

Abstract model factory interface for Data-Scientist-created models. Please inherit from this class and put your training code into the method “create_trained_model”. This method will be called by the framework when your model needs to be (re)trained.

Why not simply use static methods?

We want to make it possible for create_trained_model to pass extra info test_trained_model without extending the latter with optional keyword arguments that might be confusing for the 90% of cases where they are not needed. So we rely on the smarts of the person inheriting from this class to find a solution/shortcuts if they want to do more difficult things e.g. want to do the train/test split themselves.

abstract create_trained_model(model_conf, data_sources, data_sinks, old_model=None)[source]

Implement this method, including data prep/feature creation. No need to test your model here. Put testing code in test_trained_model, which will be called automatically after training. (Feel free to put common code for preparing data into another function, class, library, …)

Params:

model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when training. old_model: contains an old model, if it exists, which can be used for incremental training. default: None

Return:

The trained model/data/anything which you want to use in the predict() function. (usually simply a fitted model object, but can be anything, like a dict of several models, a model with some extra info, etc.) (Whatever you return here gets automatically stuffed into your ModelInterface-inherited object and is accessible there using predict’s model parameter (or the self.contents attribute.))

abstract test_trained_model(model_conf, data_sources, data_sinks, model)[source]

Implement this method, including data prep/feature creation. This method will be called to re-test a model, e.g. to check whether it has to be re-trained. (Feel free to put common code for preparing data into another function, class, library, …)

Params:

model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when testing. model: your model object (whatever you returned in create_trained_model)

Return:

Return a dict of metrics (like ‘accuracy’, ‘f1’, ‘confusion_matrix’, etc.)

mllaunchpad.get_validated_config(filename: str = './LAUNCHPAD_CFG.yml') dict[source]

Read the configuration from file and return it as a dict object.

Parameters

filename (optional str, default: environment variable LAUNCHPAD_CFG or file ./LAUNCHPAD_CFG.yml) – Path to configuration file

Returns

dict with configuration

Return type

dict

mllaunchpad.get_validated_config_str(io: Union[AnyStr, TextIO]) dict[source]

Read the configuration from a string or open file and return it as a dict object. This function exists mainly for making debugging and unit testing your model’s code easier.

Parameters

io (str or open text file handle) – Configuration as unicode string or b”byte string” or a open text file to read from

Returns

configuration

Return type

dict

mllaunchpad.list_models(model_store_location_or_config_dict: Union[Dict, str])[source]

Get information on all available versions of trained models.

Parameters

model_store_location_or_config_dict (Union[Dict, str]) – Location of the model store. If you have a config dict available, use that instead.

Side note: The return value includes backups of models that have been re-trained without changing the version number (they reside in the subdirectory previous). Please note that these backed up models are just listed for information and are not available for loading (one would have to restore them by moving them up a directory level from previous.

Example:

import mllaunchpad as mllp
my_cfg = mllp.get_validated_config("./my_config_file.yml")
all_models = mllp.list_models(my_cfg)  # also accepts model store location string

# An example of what a ``list_models()``'s result would look like:
# {
#     iris: {
#         1.0.0: { ... complete metadata of this version number ... },
#         1.1.0: { ... },
#         latest: { ... duplicate of metadata of highest available version number, here 1.1.0 ... },
#         backups: [ {...}, {...}, ... ]
#     },
#     my_other_model: {
#         1.0.1: { ... },
#         2.0.0: { ... },
#         latest: { ... },
#         backups: []
#     }
# }
Returns

Dict with information on all available trained models.

mllaunchpad.order_columns(obj: Union[pandas.core.frame.DataFrame, numpy.ndarray, Dict])[source]

Order the columns of a DataFrame, a dict, or a Numpy structured array. Use this on your training data right before passing it into the model. This will guarantee that the model is trained with a reproducible column order.

Same in your test code.

Most importantly, use this also in your predict method, as the incoming args_dict does not have a deterministic order.

Params:

obj: a DataFrame, a dict, or a Numpy structured array

Returns:

The obj with columns ordered lexicographically

mllaunchpad.predict(complete_conf: Dict, arg_dict: Optional[Dict] = None, cache: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None, use_live_code: bool = False)[source]

Carry out prediction for the model specified in the configuration.

Parameters
  • complete_conf (dict) – configuration dict

  • cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.

  • arg_dict (optional Dict, default: None) – Arguments dict for the prediction (analogous to what it would get from a web API)

  • model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

  • use_live_code (optional bool, default: False) – Use the current predict function instead of the one persisted with the model in the model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

model’s prediction output

mllaunchpad.report(name: str, value) None

Add a piece of information to the train report during training.

The train report is part of the model’s metadata that is saved to the model store. Use mllaunchpad.list_models() to query metadata from the model store.

This function is supposed to be called from your create_trained_model() or test_trained_model() implementation. You can pass any values that are JSON-able, same as with test_trained_model()’s returned metrics.

However, if the value is a DataFrame, it will be summarized (using pd.describe()). You can use this for example to improve traceability of your trained models and for some basic sanity checks of training data distribution.

Parameters
  • name (str) – Key to save the information under (e.g. “meaning_of_life”)

  • value (str, number, list, dict, Numpy Array or Pandas DataFrame) – Value to save. Any JSON-able value or structure will work. Pandas DataFrames will be summarized instead of saved.

mllaunchpad.retest(complete_conf: Dict, cache: bool = True, persist: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]

Retest a model as specified in the configuration and persist its test metrics in the model store.

Parameters
  • complete_conf (dict) – configuration dict

  • cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.

  • persist (optional bool, default: True) – Whether to update the model in model_cache: with the test metrics. This parameter exists mainly for making debugging and unit testing your model’s code easier.

  • model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

test_metrics

mllaunchpad.train_model(complete_conf: Dict, cache: bool = True, persist: bool = True, test: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]

Train and test a model as specified in the configuration and persist it in the model store.

Parameters
  • complete_conf (dict) – configuration dict

  • cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.

  • persist (optional bool, default: True) – Whether to store the trained model in the location configured by model_cache:. This parameter exists mainly for making debugging and unit testing your model’s code easier.

  • test (optional bool, default: True) – Whether to test the model after training. This parameter exists mainly for making debugging and unit testing your model’s code easier.

  • model (optional object implementing ModelInterface, default: None) – Use this model as previous model instead trying to load it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

Tuple of (object implementing ModelInterface, metrics)

Submodules

mllaunchpad.api module

This module contains functionality for generic creation and handling of RESTful APIs for Machine Learning Models. Among others, it handles parsing the RAML definition, and validating parameters.

class mllaunchpad.api.GetByIdResource(model_api_obj, parser, id_name)[source]

Bases: flask_restful.Resource

get(some_resource_id)[source]
methods: Optional[List[str]] = {'GET'}

A list of methods this view can handle.

class mllaunchpad.api.ModelApi(config, application, debug=False)[source]

Bases: object

Class to plug a Data-Scientist-created model into.

This class handles the heavy lifting of APIs for the model.

The model is a delegate which inherits from (=implements) ModelInterface. It needs to provide a predict function.

For details, see the documentation in the module model_interface

predict_using_model(args_dict)[source]
class mllaunchpad.api.QueryOrFileUploadResource(model_api_obj, query_parser=None, file_parser=None)[source]

Bases: flask_restful.Resource

get()[source]
methods: Optional[List[str]] = {'GET', 'POST'}

A list of methods this view can handle.

post()[source]
class mllaunchpad.api.QueryResource(model_api_obj, parser)[source]

Bases: flask_restful.Resource

get()[source]
methods: Optional[List[str]] = {'GET'}

A list of methods this view can handle.

mllaunchpad.api.generate_raml(complete_conf, data_source_name=None, data_frame=None, resource_name='mythings')[source]
mllaunchpad.api.get_api_base_url(config)[source]

mllaunchpad.cli module

This module provides the command line interface for ML Launchpad

class mllaunchpad.cli.AliasedGroup(name: Optional[str] = None, commands: Optional[Union[Dict[str, click.core.Command], Sequence[click.core.Command]]] = None, **attrs: Any)[source]

Bases: click.core.Group

Commands can be abbreviated, e.g. t or tr for train, a for api, etc.

get_command(ctx, cmd_name)[source]

Given a context and a command name, this returns a Command object if it exists or returns None.

class mllaunchpad.cli.Settings[source]

Bases: object

property config

mllaunchpad.config module

This module contains functionality for reading and validating the configuration.

mllaunchpad.config.check_semantics(config_dict)[source]
mllaunchpad.config.get_validated_config(filename: str = './LAUNCHPAD_CFG.yml') dict[source]

Read the configuration from file and return it as a dict object.

Parameters

filename (optional str, default: environment variable LAUNCHPAD_CFG or file ./LAUNCHPAD_CFG.yml) – Path to configuration file

Returns

dict with configuration

Return type

dict

mllaunchpad.config.get_validated_config_str(io: Union[AnyStr, TextIO]) dict[source]

Read the configuration from a string or open file and return it as a dict object. This function exists mainly for making debugging and unit testing your model’s code easier.

Parameters

io (str or open text file handle) – Configuration as unicode string or b”byte string” or a open text file to read from

Returns

configuration

Return type

dict

mllaunchpad.config.validate_config(config_dict, required, path='')[source]

mllaunchpad.datasources module

class mllaunchpad.datasources.FileDataSink(identifier: str, datasink_config: Dict)[source]

Bases: mllaunchpad.resource.DataSink

DataSink for putting data into files.

See serves for the available types.

Configuration example:

datasinks:
  # ... (other datasinks)
  my_datasink:
    type: euro_csv  # `euro_csv` changes separators to ";" and decimals to "," w.r.t. `csv`
    path: /some/file.csv  # Can be URL, uses `df.to_csv` internally
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when fetching the data using `df.to_csv`
    dtypes_path: ./some/file.dtypes # optional: location for saving the csv's column dtypes info
  my_raw_datasink:
    type: text_file  # raw files can also be of type `binary_file`
    path: /some/file.txt  # Can be URL
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when writing the data using `fh.write`

When saving csv or euro_csv type formats, you can use the setting dtypes_path to specify a location where to save dtypes descriptions for the csv (that you can use later with FileDataSource’s dtypes_path setting). These dtypes will be enforced when reading the csv, which helps avoid problems when pandas.read_csv interprets data differently than you do. Use dtypes_path to enforce dtype parity between csv datasinks and datasources.

Using the raw formats binary_file and text_file, you can persist arbitrary data, as long as it can be represented as a bytes or a str object, respectively. text_file uses UTF-8 encoding. Please note that while possible, it is not recommended to persist DataFrame`s this way, because by adding format-specific code to your model, you’re giving up your code’s independence from the type of `DataSource/DataSink. Here’s an example for pickling an arbitrary object:

# config fragment:
datasinks:
  # ...
  my_pickle_datasink:
    type: binary_file
    path: /some/file.pickle
    tags: [train]
    options: {}

# code fragment:
import pickle
# ...
# in predict/test/train code:
my_pickle = pickle.dumps(my_object)
data_sinks["my_pickle_datasink"].put_raw(my_pickle)
put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) None[source]

Write a pandas dataframe to file and optionally the dtypes if included in the configuration. The default is not to save the dataframe’s row index. Configure the DataSink’s options dict to pass keyword arguments to my_df.to_csv. If the directory path leading to the file does not exist, it will be created.

Example:

data_sinks["my_datasink"].put_dataframe(my_df)
Parameters
  • dataframe (pandas DataFrame) – The pandas dataframe to save

  • params (optional dict) – Currently not implemented

  • chunksize (optional bool) – Currently not implemented

put_raw(raw_data: Union[str, bytes], params: Optional[Dict] = None, chunksize: Optional[int] = None) None[source]

Write raw (unstructured) data to file. If the directory path leading to the file does not exist, it will be created.

Example:

data_sinks["my_raw_datasink"].put_raw(my_data)
Parameters
  • raw_data (bytes or str) – The data to save (bytes for binary, string for text file)

  • params (optional dict) – Currently not implemented

  • chunksize (optional bool) – Currently not implemented

serves: List[str] = ['csv', 'euro_csv', 'text_file', 'binary_file']
class mllaunchpad.datasources.FileDataSource(identifier: str, datasource_config: Dict)[source]

Bases: mllaunchpad.resource.DataSource

DataSource for fetching data from files.

See serves for the available types.

Configuration example:

datasources:
  # ... (other datasources)
  my_datasource:
    type: euro_csv  # `euro_csv` changes separators to ";" and decimals to "," w.r.t. `csv`
    path: /some/file.csv  # Can be URL, uses `pandas.read_csv` internally
    expires: 0    # generic parameter, see documentation on DataSources
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when fetching the data using `pandas.read_csv`
    dtypes_path: ./some/file.dtypes # optional: location with the csv's column dtypes info
  my_raw_datasource:
    type: text_file  # raw files can also be of type `binary_file`
    path: /some/file.txt  # Can be URL
    expires: 0    # generic parameter, see documentation on DataSources
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when fetching the data using `fh.read`

When loading csv or euro_csv type formats, you can use the setting dtypes_path to specify a location with dtypes description for the csv (usually generated earlier by using FileDataSink’s dtypes_path setting). These dtypes will be enforced when reading the csv, which helps avoid problems when pandas.read_csv interprets data differently than you do. Use dtypes_path to enforce dtype parity between csv datasinks and datasources.

Using the raw formats binary_file and text_file, you can read arbitrary data, as long as it can be represented as a bytes or a str object, respectively. text_file uses UTF-8 encoding. Please note that while possible, it is not recommended to persist DataFrame`s this way, because by adding format-specific code to your model, you’re giving up your code’s independence from the type of `DataSource/DataSink. Here’s an example for unpickling an arbitrary object:

# config fragment:
datasources:
  # ...
  my_pickle_datasource:
    type: binary_file
    path: /some/file.pickle
    tags: [train]
    options: {}

# code fragment:
import pickle
# ...
# in predict/test/train code:
my_pickle = data_sources["my_pickle_datasource"].get_raw()
my_object = pickle.loads(my_pickle)
get_dataframe(params: Optional[Dict] = None, chunksize: Optional[int] = None)

Get data as a pandas dataframe.

Example:

data_sources["my_datasource"].get_dataframe()
Parameters
  • params (optional dict) – Currently not implemented

  • chunksize (optional int) – Return an iterator where chunksize is the number of rows to include in each chunk.

Returns

DataFrame object, possibly cached according to config value of expires:

get_raw(params: Optional[Dict] = None, chunksize: Optional[int] = None)

Get data as raw (unstructured) data.

Example:

data_sources["my_raw_datasource"].get_raw()
Parameters
  • params (optional dict) – Currently not implemented

  • chunksize (optional bool) – Currently not implemented

Returns

The file’s bytes (binary) or string (text) contents, possibly cached according to config value of expires:

Return type

bytes or str

serves: List[str] = ['csv', 'euro_csv', 'text_file', 'binary_file']
class mllaunchpad.datasources.OracleDataSink(identifier: str, datasink_config: Dict, dbms_config: Dict)[source]

Bases: mllaunchpad.resource.DataSink

DataSink for Oracle database connections.

Creates a long-living connection on initialization.

Configuration example:

dbms:
  # ... (other connections)
  my_connection:  # NOTE: You can use the same connection for several datasources and datasinks
    type: oracle
    host: host.example.com
    port: 1251
    user_var: MY_USER_ENV_VAR
    password_var: MY_PW_ENV_VAR  # optional
    service_name: servicename.example.com
    options: {}  # used as **kwargs when initializing the DB connection
# ...
datasinks:
  # ... (other datasinks)
  my_datasink:
    type: dbms.my_connection
    table: somewhere.my_table
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when storing the table using `my_df.to_sql`
put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) None[source]

Store the pandas dataframe as a table. The default is not to store the dataframe’s row index. Configure the DataSink’s options dict to pass keyword arguments to df.to_sql.

Example:

data_sinks["my_datasink"].put_dataframe(my_df)
Parameters
  • dataframe (pandas DataFrame) – The pandas dataframe to store

  • params (optional dict) – Currently not implemented

  • chunksize (optional bool) – Currently not implemented

put_raw(raw_data, params: Optional[Dict] = None, chunksize: Optional[int] = None) None[source]

Not implemented.

Raises

NotImplementedError – Raw/blob format currently not supported.

serves: List[str] = ['dbms.oracle']
class mllaunchpad.datasources.OracleDataSource(identifier: str, datasource_config: Dict, dbms_config: Dict)[source]

Bases: mllaunchpad.resource.DataSource

DataSource for Oracle database connections.

Creates a long-living connection on initialization.

Configuration example:

dbms:
  # ... (other connections)
  my_connection:  # NOTE: You can use the same connection for several datasources and datasinks
    type: oracle
    host: host.example.com
    port: 1251
    user_var: MY_USER_ENV_VAR
    password_var: MY_PW_ENV_VAR  # optional
    service_name: servicename.example.com
    options: {}  # used as **kwargs when initializing the DB connection
# ...
datasources:
  # ... (other datasources)
  my_datasource:
    type: dbms.my_connection
    query: SELECT * FROM somewhere.my_table where id = :id  # fill `:params` by calling `get_dataframe` with a `dict`
    expires: 0    # generic parameter, see documentation on DataSources
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when fetching the query using `pandas.read_sql`
get_dataframe(params: Optional[Dict] = None, chunksize: Optional[int] = None)

Get the data as pandas dataframe.

Null values are replaced by numpy.nan.

Example:

data_sources["my_datasource"].get_dataframe({"id": 387})
Parameters
  • params (optional dict) – Query parameters to fill in query (e.g. replace query’s :id parameter with value 387)

  • chunksize (optional int) – Return an iterator where chunksize is the number of rows to include in each chunk.

Returns

DataFrame object, possibly cached according to config value of expires:

get_raw(params: Optional[Dict] = None, chunksize: Optional[int] = None)

Not implemented.

Raises

NotImplementedError – Raw/blob format currently not supported.

serves: List[str] = ['dbms.oracle']
class mllaunchpad.datasources.SqlDataSink(identifier: str, datasource_config: Dict, dbms_config: Dict)[source]

Bases: mllaunchpad.resource.DataSink

DataSink for RedShift, Postgres, MySQL, SQLite, Oracle, Microsoft SQL (ODBC), and their dialects.

Uses SQLAlchemy under the hood, and as such, manages a connection pool automatically.

Please configure the dbms:<name>:connection_string:, which is a standard RFC-1738 URL with the syntax dialect[+driver]://[user:password@][host]/[dbname][?key=value..]. The exact URL is specific for the database you want to connect to. Find examples for all supported database dialects here.

Depending on the dialect you want to use, you might need to install additional drivers and packages. For example, for connecting to a kerberized Impala instance via ODBC, you need to:

  1. Install Impala ODBC drivers for your OS,

  2. pip install winkerberos thrift_sasl pyodbc sqlalchemy # use pykerberos for non-windows systems

If you are tasked with connecting to a particular database system, and don’t know where to start, researching on how to connect to it from SQLAlchemy will serve as a good starting point.

Other configuration in the dbms: section (besides connection_string:) is optional, but can be provided if deemed necessary:

  • Any dbms:-level settings other than type:, connection_string: and options: will be passed as additional keyword arguments to SQLAlchemy’s create_engine.

  • Any key-value pairs inside dbms:<name>:options: {} will be passed to SQLAlchemy as connect_args. If you append _var to the end of an argument key, its value will be interpreted as an environment variable name which ML Launchpad will attempt to get a value from. This can be useful for information like passwords which you do not want to store in the configuration file.

Configuration example:

dbms:
  # ... (other connections)
  # Example for connecting to a kerberized Impala instance via ODBC:
  my_connection:  # NOTE: You can use the same connection for several datasources and datasinks
    type: sql
    connection_string: mssql+pyodbc:///default?&driver=Cloudera+ODBC+Driver+for+Impala&host=servername.somedomain.com&port=21050&authmech=1&krbservicename=impala&ssl=1&usesasl=1&ignoretransactions=1&usesystemtruststore=1
    # pyodbc alternative: mssql+pyodbc:///?odbc_connect=DRIVER%3D%7BCloudera+ODBC+Driver+for+Impala%7D%3BHOST%3Dservername.somedomain.com%3BPORT%3D21050%3BAUTHMECH%3D1%3BKRBSERVICENAME%3Dimpala%3BSSL%3D1%3BUSESASL%3D1%3BIGNORETRANSACTIONS%3D1%3BUSESYSTEMTRUSTSTORE%3D1
    echo: True  # example for an additional SQLAlchemy keyword argument (logs the SQL) -- these are optional
    options: {}  # used as `connect_args` when creating the SQLAlchemy engine
# ...
datasinks:
  # ... (other datasinks)
  my_datasink:
    type: dbms.my_connection
    table: somewhere.my_table
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when storing the table using `my_df.to_sql`
put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) None[source]

Store the pandas dataframe as a table. The default is not to store the dataframe’s row index. Configure the DataSink’s options dict to pass keyword arguments to df.to_sql.

Example:

data_sinks["my_datasink"].put_dataframe(my_df)
Parameters
  • dataframe (pandas DataFrame) – The pandas dataframe to store

  • params (optional dict) – Currently not implemented

  • chunksize (optional bool) – Currently not implemented

put_raw(raw_data, params: Optional[Dict] = None, chunksize: Optional[int] = None) None[source]

Not implemented.

Raises

NotImplementedError – Raw/blob format currently not supported.

serves: List[str] = ['dbms.sql']
class mllaunchpad.datasources.SqlDataSource(identifier: str, datasource_config: Dict, dbms_config: Dict)[source]

Bases: mllaunchpad.resource.DataSource

DataSource for RedShift, Postgres, MySQL, SQLite, Oracle, Microsoft SQL (ODBC), and their dialects.

Uses SQLAlchemy under the hood, and as such, manages a connection pool automatically.

Please configure the dbms:<name>:connection_string:, which is a standard RFC-1738 URL with the syntax dialect[+driver]://[user:password@][host]/[dbname][?key=value..]. The exact URL is specific for the database you want to connect to. Find examples for all supported database dialects here.

Depending on the dialect you want to use, you might need to install additional drivers and packages. For example, for connecting to a kerberized Impala instance via ODBC, you need to:

  1. Install Impala ODBC drivers for your OS,

  2. pip install winkerberos thrift_sasl pyodbc sqlalchemy # use pykerberos for non-windows systems

If you are tasked with connecting to a particular database system, and don’t know where to start, researching on how to connect to it from SQLAlchemy will serve as a good starting point.

Other configuration in the dbms: section (besides connection_string:) is optional, but can be provided if deemed necessary:

  • Any dbms:-level settings other than type:, connection_string: and options: will be passed as additional keyword arguments to SQLAlchemy’s create_engine.

  • Any key-value pairs inside dbms:<name>:options: {} will be passed to SQLAlchemy as connect_args. If you append _var to the end of an argument key, its value will be interpreted as an environment variable name which ML Launchpad will attempt to get a value from. This can be useful for information like passwords which you do not want to store in the configuration file.

Configuration example:

dbms:
  # ... (other connections)
  # Example for connecting to a kerberized Impala instance via ODBC:
  my_connection:  # NOTE: You can use the same connection for several datasources and datasinks
    type: sql
    connection_string: mssql+pyodbc:///default?&driver=Cloudera+ODBC+Driver+for+Impala&host=servername.somedomain.com&port=21050&authmech=1&krbservicename=impala&ssl=1&usesasl=1&ignoretransactions=1&usesystemtruststore=1
    # pyodbc alternative: mssql+pyodbc:///?odbc_connect=DRIVER%3D%7BCloudera+ODBC+Driver+for+Impala%7D%3BHOST%3Dservername.somedomain.com%3BPORT%3D21050%3BAUTHMECH%3D1%3BKRBSERVICENAME%3Dimpala%3BSSL%3D1%3BUSESASL%3D1%3BIGNORETRANSACTIONS%3D1%3BUSESYSTEMTRUSTSTORE%3D1
    echo: True  # example for an additional SQLAlchemy keyword argument (logs the SQL) -- these are optional
    options: {}  # used as `connect_args` when creating the SQLAlchemy engine
# ...
datasources:
  # ... (other datasources)
  my_datasource:
    type: dbms.my_connection
    query: SELECT * FROM somewhere.my_table WHERE id = :id  # fill `:params` by calling `get_dataframe` with a `dict`
    expires: 0    # generic parameter, see documentation on DataSources
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when fetching the query using `pandas.read_sql`
get_dataframe(params: Optional[Dict] = None, chunksize: Optional[int] = None)

Get the data as pandas dataframe.

Null values are replaced by numpy.nan.

Example:

my_df = data_sources["my_datasource"].get_dataframe({"id": 387})
Parameters
  • params (optional dict) – Query parameters to fill in query (e.g. replace query’s :id parameter with value 387)

  • chunksize (optional int) – Return an iterator where chunksize is the number of rows to include in each chunk.

Returns

DataFrame object, possibly cached according to config value of expires:

get_raw(params: Optional[Dict] = None, chunksize: Optional[int] = None)

Not implemented.

Raises

NotImplementedError – Raw/blob format currently not supported.

serves: List[str] = ['dbms.sql']
mllaunchpad.datasources.ensure_dir_to(file_path)[source]
mllaunchpad.datasources.fill_nas(df: pandas.core.frame.DataFrame, as_generator: bool = False) Union[pandas.core.frame.DataFrame, Generator][source]
mllaunchpad.datasources.get_connection_args(dbms_config: Dict) Dict[source]

Fill “_var”-suffixed configuration items from environment variables

mllaunchpad.logutil module

mllaunchpad.logutil.init_logging(filename='./LAUNCHPAD_LOG.yml', verbose=False)[source]

Only called from wsgi or cli module (mllaunchpad-as-an-app). It’s important to not change logging/warning config from the library-only code.

mllaunchpad.model_actions module

Convenience functions for executing training, testing and prediction

mllaunchpad.model_actions.clear_caches()[source]
mllaunchpad.model_actions.predict(complete_conf: Dict, arg_dict: Optional[Dict] = None, cache: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None, use_live_code: bool = False)[source]

Carry out prediction for the model specified in the configuration.

Parameters
  • complete_conf (dict) – configuration dict

  • cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.

  • arg_dict (optional Dict, default: None) – Arguments dict for the prediction (analogous to what it would get from a web API)

  • model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

  • use_live_code (optional bool, default: False) – Use the current predict function instead of the one persisted with the model in the model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

model’s prediction output

mllaunchpad.model_actions.retest(complete_conf: Dict, cache: bool = True, persist: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]

Retest a model as specified in the configuration and persist its test metrics in the model store.

Parameters
  • complete_conf (dict) – configuration dict

  • cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.

  • persist (optional bool, default: True) – Whether to update the model in model_cache: with the test metrics. This parameter exists mainly for making debugging and unit testing your model’s code easier.

  • model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

test_metrics

mllaunchpad.model_actions.train_model(complete_conf: Dict, cache: bool = True, persist: bool = True, test: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]

Train and test a model as specified in the configuration and persist it in the model store.

Parameters
  • complete_conf (dict) – configuration dict

  • cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.

  • persist (optional bool, default: True) – Whether to store the trained model in the location configured by model_cache:. This parameter exists mainly for making debugging and unit testing your model’s code easier.

  • test (optional bool, default: True) – Whether to test the model after training. This parameter exists mainly for making debugging and unit testing your model’s code easier.

  • model (optional object implementing ModelInterface, default: None) – Use this model as previous model instead trying to load it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

Tuple of (object implementing ModelInterface, metrics)

mllaunchpad.model_actions.train_report() Iterator[Dict[str, Any]][source]

mllaunchpad.model_interface module

class mllaunchpad.model_interface.ModelInterface(contents=None)[source]

Bases: abc.ABC

Abstract model interface for Data-Scientist-created models. Please inherit from this class when creating your model to make it usable for ModelApi.

You don’t need to create this object yourself when training. It is created automatically and the model/info returned from create_trained_model is made accessible to you through the self.contents attribute.

abstract predict(model_conf, data_sources, data_sinks, model, args_dict)[source]

Implement this method, including data prep/feature creation based on argsDict. argsDict can also contain an id which the model can use to fetch data from any data_sources. (Feel free to put common code for preparing data into another function, class, library, …)

Params:

model_conf: the model configuration dict from the config file data_sources: dict containing the data sources data_sinks: dict containing the data sinks, as configured in the config file. model: your model object (whatever you returned in create_trained_model) argsDict: parameters the API was called with, dict of strings (any type conversion needs to be done by you)

Return:

Prediction result as a dictionary/list structure which will be automatically turned into JSON.

class mllaunchpad.model_interface.ModelMakerInterface[source]

Bases: abc.ABC

Abstract model factory interface for Data-Scientist-created models. Please inherit from this class and put your training code into the method “create_trained_model”. This method will be called by the framework when your model needs to be (re)trained.

Why not simply use static methods?

We want to make it possible for create_trained_model to pass extra info test_trained_model without extending the latter with optional keyword arguments that might be confusing for the 90% of cases where they are not needed. So we rely on the smarts of the person inheriting from this class to find a solution/shortcuts if they want to do more difficult things e.g. want to do the train/test split themselves.

abstract create_trained_model(model_conf, data_sources, data_sinks, old_model=None)[source]

Implement this method, including data prep/feature creation. No need to test your model here. Put testing code in test_trained_model, which will be called automatically after training. (Feel free to put common code for preparing data into another function, class, library, …)

Params:

model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when training. old_model: contains an old model, if it exists, which can be used for incremental training. default: None

Return:

The trained model/data/anything which you want to use in the predict() function. (usually simply a fitted model object, but can be anything, like a dict of several models, a model with some extra info, etc.) (Whatever you return here gets automatically stuffed into your ModelInterface-inherited object and is accessible there using predict’s model parameter (or the self.contents attribute.))

abstract test_trained_model(model_conf, data_sources, data_sinks, model)[source]

Implement this method, including data prep/feature creation. This method will be called to re-test a model, e.g. to check whether it has to be re-trained. (Feel free to put common code for preparing data into another function, class, library, …)

Params:

model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when testing. model: your model object (whatever you returned in create_trained_model)

Return:

Return a dict of metrics (like ‘accuracy’, ‘f1’, ‘confusion_matrix’, etc.)

mllaunchpad.resource module

class mllaunchpad.resource.CacheDict(*args, **kwds)[source]

Bases: collections.OrderedDict

class mllaunchpad.resource.CachedDataSource(name, bases, dct)[source]

Bases: type

Metaclass to Auto-apply decorators “@cached” to data getters. https://stackoverflow.com/questions/10067262/automatically-decorating-every-instance-method-in-a-class

classmethod cached(func)[source]

This decorator is automatically applied to get_dataframe and get_raw methods to enable caching.

class mllaunchpad.resource.DataSink(identifier: str, datasink_config: Dict, sub_config: Optional[Dict] = None)[source]

Bases: object

Interface, used by the Data Scientist’s model to persist data (usually prediction results). Concrete DataSinks (for files, data bases, etc.) need to inherit from this class.

abstract put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) None[source]
abstract put_raw(raw_data: Union[str, bytes], params: Optional[Dict] = None, chunksize: Optional[int] = None) None[source]
serves: List[str] = []
class mllaunchpad.resource.DataSource(identifier: str, datasource_config: Dict, sub_config: Optional[Dict] = None)[source]

Bases: object

Interface, used by the Data Scientist’s model to get its data from. Concrete DataSources (for files, data bases, etc.) need to inherit from this class.

get_dataframe(params: Optional[Dict] = None, chunksize: Optional[int] = None)[source]
get_raw(params: Optional[Dict] = None, chunksize: Optional[int] = None)[source]
serves: List[str] = []
class mllaunchpad.resource.ModelStore(config: Union[Dict, str])[source]

Bases: object

Deals with persisting, loading, updating metrics metadata of models. Abstracts away how and where the model is kept.

TODO: Smarter querying like ‘get me the model with the currently (next) best metrics which serves a particular API.’

add_to_train_report(name: str, value)[source]
dump_trained_model(complete_conf, model, metrics)[source]

Save a model object in the model store. Some metadata will also be saved along the model, including the metrics which is the second parameter.

Params:

model_conf: the config dict of our model model: the model object to store metrics: metrics dictionary

Returns:

Nothing

list_models()[source]

Get information on all available versions of trained models.

Side note: This also includes backups of models that have been re-trained without changing the version number (they reside in the subdirectory previous). Please note that these backed up models are just listed for information and are not available for loading (one would have to restore them by moving them up a directory level from previous.

Example:

ms = mllp.ModelStore("./model_store")
all_models = ms.list_models()

# An example of what a ``list_models()``'s result would look like:
{
    iris: {
        1.0.0: { ... complete metadata of this version number ... },
        1.1.0: { ... },
        latest: { ... duplicate of metadata of highest available version number, here 1.1.0 ... },
        backups: [ {...}, {...}, ... ]
    },
    my_other_model: {
        1.0.1: { ... },
        2.0.0: { ... },
        latest: { ... },
        backups: []
    }
}
Returns

Dict with information on all available trained models.

load_trained_model(model_conf)[source]

Load a model object from the model store. Some metadata will also be loaded along the model.

Params:

model_conf: the config dict of our model

Returns:

Tuple of model object and metadata dictionary

update_model_metrics(model_conf, metrics)[source]

Update the test metrics for a previously stored model

mllaunchpad.resource.create_data_sources_and_sinks(config: Dict, tags: Optional[Iterable[str]] = None) Tuple[Dict[str, mllaunchpad.resource.DataSource], Dict[str, mllaunchpad.resource.DataSink]][source]

Creates the data sources as defined in the configuration dict. Filters them by tag.

Params:

config: configuration dictionary tags: optionally filter for only matching datasources no value(s) = match all datasources

Returns:

dict with keys=datasource names, values=initialized DataSource objects

mllaunchpad.resource.get_user_pw(user_var: str, password_var: str) Tuple[str, Optional[str]][source]
mllaunchpad.resource.order_columns(obj: Union[pandas.core.frame.DataFrame, numpy.ndarray, Dict])[source]

Order the columns of a DataFrame, a dict, or a Numpy structured array. Use this on your training data right before passing it into the model. This will guarantee that the model is trained with a reproducible column order.

Same in your test code.

Most importantly, use this also in your predict method, as the incoming args_dict does not have a deterministic order.

Params:

obj: a DataFrame, a dict, or a Numpy structured array

Returns:

The obj with columns ordered lexicographically

mllaunchpad.resource.to_plain_python_obj(possible_ndarray)[source]

mllaunchpad.yaml_loader module

class mllaunchpad.yaml_loader.SafeIncludeLoader(stream)[source]

Bases: yaml.loader.SafeLoader

A subclass of SafeLoader which supports !include file references.

include(node)[source]
yaml_constructors = {'tag:yaml.org,2002:null': <function SafeConstructor.construct_yaml_null>, 'tag:yaml.org,2002:bool': <function SafeConstructor.construct_yaml_bool>, 'tag:yaml.org,2002:int': <function SafeConstructor.construct_yaml_int>, 'tag:yaml.org,2002:float': <function SafeConstructor.construct_yaml_float>, 'tag:yaml.org,2002:binary': <function SafeConstructor.construct_yaml_binary>, 'tag:yaml.org,2002:timestamp': <function SafeConstructor.construct_yaml_timestamp>, 'tag:yaml.org,2002:omap': <function SafeConstructor.construct_yaml_omap>, 'tag:yaml.org,2002:pairs': <function SafeConstructor.construct_yaml_pairs>, 'tag:yaml.org,2002:set': <function SafeConstructor.construct_yaml_set>, 'tag:yaml.org,2002:str': <function SafeConstructor.construct_yaml_str>, 'tag:yaml.org,2002:seq': <function SafeConstructor.construct_yaml_seq>, 'tag:yaml.org,2002:map': <function SafeConstructor.construct_yaml_map>, None: <function SafeConstructor.construct_undefined>, '!include': <function SafeIncludeLoader.include>}