mllaunchpad package¶

Top-level package for ML Launchpad.

class mllaunchpad.ModelInterface(contents=None)[source]¶

Bases: abc.ABC

Abstract model interface for Data-Scientist-created models. Please inherit from this class when creating your model to make it usable for ModelApi.

You don’t need to create this object yourself when training. It is created automatically and the model/info returned from create_trained_model is made accessible to you through the self.contents attribute.

abstract predict(model_conf, data_sources, data_sinks, model, args_dict)[source]¶

Implement this method, including data prep/feature creation based on argsDict. argsDict can also contain an id which the model can use to fetch data from any data_sources. (Feel free to put common code for preparing data into another function, class, library, …)

Params:: model_conf: the model configuration dict from the config file data_sources: dict containing the data sources data_sinks: dict containing the data sinks, as configured in the config file. model: your model object (whatever you returned in create_trained_model) argsDict: parameters the API was called with, dict of strings (any type conversion needs to be done by you)
Return:: Prediction result as a dictionary/list structure which will be automatically turned into JSON.

class mllaunchpad.ModelMakerInterface[source]¶

Bases: abc.ABC

Abstract model factory interface for Data-Scientist-created models. Please inherit from this class and put your training code into the method “create_trained_model”. This method will be called by the framework when your model needs to be (re)trained.

Why not simply use static methods?: We want to make it possible for create_trained_model to pass extra info test_trained_model without extending the latter with optional keyword arguments that might be confusing for the 90% of cases where they are not needed. So we rely on the smarts of the person inheriting from this class to find a solution/shortcuts if they want to do more difficult things e.g. want to do the train/test split themselves.

abstract create_trained_model(model_conf, data_sources, data_sinks, old_model=None)[source]¶

Implement this method, including data prep/feature creation. No need to test your model here. Put testing code in test_trained_model, which will be called automatically after training. (Feel free to put common code for preparing data into another function, class, library, …)

Params:: model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when training. old_model: contains an old model, if it exists, which can be used for incremental training. default: None
Return:: The trained model/data/anything which you want to use in the predict() function. (usually simply a fitted model object, but can be anything, like a dict of several models, a model with some extra info, etc.) (Whatever you return here gets automatically stuffed into your ModelInterface-inherited object and is accessible there using predict’s model parameter (or the self.contents attribute.))

abstract test_trained_model(model_conf, data_sources, data_sinks, model)[source]¶

Implement this method, including data prep/feature creation. This method will be called to re-test a model, e.g. to check whether it has to be re-trained. (Feel free to put common code for preparing data into another function, class, library, …)

Params:: model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when testing. model: your model object (whatever you returned in create_trained_model)
Return:: Return a dict of metrics (like ‘accuracy’, ‘f1’, ‘confusion_matrix’, etc.)

mllaunchpad.get_validated_config(filename: str = './LAUNCHPAD_CFG.yml') → dict[source]¶

Read the configuration from file and return it as a dict object.

Parameters: filename (optional str, default: environment variable LAUNCHPAD_CFG or file ./LAUNCHPAD_CFG.yml) – Path to configuration file
Returns: dict with configuration
Return type: dict

mllaunchpad.get_validated_config_str(io: Union[AnyStr, TextIO]) → dict[source]¶

Read the configuration from a string or open file and return it as a dict object. This function exists mainly for making debugging and unit testing your model’s code easier.

Parameters: io (str or open text file handle) – Configuration as unicode string or b”byte string” or a open text file to read from
Returns: configuration
Return type: dict

mllaunchpad.list_models(model_store_location_or_config_dict: Union[Dict, str])[source]¶

Get information on all available versions of trained models.

Parameters: model_store_location_or_config_dict (Union[Dict, str]) – Location of the model store. If you have a config dict available, use that instead.

Side note: The return value includes backups of models that have been re-trained without changing the version number (they reside in the subdirectory previous). Please note that these backed up models are just listed for information and are not available for loading (one would have to restore them by moving them up a directory level from previous.

Example:

import mllaunchpad as mllp
my_cfg = mllp.get_validated_config("./my_config_file.yml")
all_models = mllp.list_models(my_cfg)  # also accepts model store location string

# An example of what a ``list_models()``'s result would look like:
# {
#     iris: {
#         1.0.0: { ... complete metadata of this version number ... },
#         1.1.0: { ... },
#         latest: { ... duplicate of metadata of highest available version number, here 1.1.0 ... },
#         backups: [ {...}, {...}, ... ]
#     },
#     my_other_model: {
#         1.0.1: { ... },
#         2.0.0: { ... },
#         latest: { ... },
#         backups: []
#     }
# }

Returns: Dict with information on all available trained models.

mllaunchpad.order_columns(obj: Union[pandas.core.frame.DataFrame, numpy.ndarray, Dict])[source]¶

Order the columns of a DataFrame, a dict, or a Numpy structured array. Use this on your training data right before passing it into the model. This will guarantee that the model is trained with a reproducible column order.

Same in your test code.

Most importantly, use this also in your predict method, as the incoming args_dict does not have a deterministic order.

Params:: obj: a DataFrame, a dict, or a Numpy structured array
Returns:: The obj with columns ordered lexicographically

mllaunchpad.predict(complete_conf: Dict, arg_dict: Optional[Dict] = None, cache: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None, use_live_code: bool = False)[source]¶

Carry out prediction for the model specified in the configuration.

Parameters

complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
arg_dict (optional Dict, default: None) – Arguments dict for the prediction (analogous to what it would get from a web API)
model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.
use_live_code (optional bool, default: False) – Use the current predict function instead of the one persisted with the model in the model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

model’s prediction output

mllaunchpad.report(name: str, value) → None¶

Add a piece of information to the train report during training.

The train report is part of the model’s metadata that is saved to the model store. Use mllaunchpad.list_models() to query metadata from the model store.

This function is supposed to be called from your create_trained_model() or test_trained_model() implementation. You can pass any values that are JSON-able, same as with test_trained_model()’s returned metrics.

However, if the value is a DataFrame, it will be summarized (using pd.describe()). You can use this for example to improve traceability of your trained models and for some basic sanity checks of training data distribution.

Parameters

name (str) – Key to save the information under (e.g. “meaning_of_life”)
value (str, number, list, dict, Numpy Array or Pandas DataFrame) – Value to save. Any JSON-able value or structure will work. Pandas DataFrames will be summarized instead of saved.

mllaunchpad.retest(complete_conf: Dict, cache: bool = True, persist: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]¶

Retest a model as specified in the configuration and persist its test metrics in the model store.

Parameters

complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
persist (optional bool, default: True) – Whether to update the model in model_cache: with the test metrics. This parameter exists mainly for making debugging and unit testing your model’s code easier.
model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

test_metrics

mllaunchpad.train_model(complete_conf: Dict, cache: bool = True, persist: bool = True, test: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]¶

Train and test a model as specified in the configuration and persist it in the model store.

Parameters

complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
persist (optional bool, default: True) – Whether to store the trained model in the location configured by model_cache:. This parameter exists mainly for making debugging and unit testing your model’s code easier.
test (optional bool, default: True) – Whether to test the model after training. This parameter exists mainly for making debugging and unit testing your model’s code easier.
model (optional object implementing ModelInterface, default: None) – Use this model as previous model instead trying to load it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

Tuple of (object implementing ModelInterface, metrics)

Submodules¶

mllaunchpad.api module¶

This module contains functionality for generic creation and handling of RESTful APIs for Machine Learning Models. Among others, it handles parsing the RAML definition, and validating parameters.

class mllaunchpad.api.GetByIdResource(model_api_obj, parser, id_name)[source]¶

Bases: flask_restful.Resource

get(some_resource_id)[source]¶

methods: Optional[List[str]] = {'GET'}¶: A list of methods this view can handle.

class mllaunchpad.api.ModelApi(config, application, debug=False)[source]¶

Bases: object

Class to plug a Data-Scientist-created model into.

This class handles the heavy lifting of APIs for the model.

The model is a delegate which inherits from (=implements) ModelInterface. It needs to provide a predict function.

For details, see the documentation in the module model_interface

predict_using_model(args_dict)[source]¶

class mllaunchpad.api.QueryOrFileUploadResource(model_api_obj, query_parser=None, file_parser=None)[source]¶

Bases: flask_restful.Resource

get()[source]¶

methods: Optional[List[str]] = {'GET', 'POST'}¶: A list of methods this view can handle.

post()[source]¶

class mllaunchpad.api.QueryResource(model_api_obj, parser)[source]¶

Bases: flask_restful.Resource

get()[source]¶

methods: Optional[List[str]] = {'GET'}¶: A list of methods this view can handle.

mllaunchpad.api.generate_raml(complete_conf, data_source_name=None, data_frame=None, resource_name='mythings')[source]¶

mllaunchpad.api.get_api_base_url(config)[source]¶

mllaunchpad.cli module¶

This module provides the command line interface for ML Launchpad

class mllaunchpad.cli.AliasedGroup(name: Optional[str] = None, commands: Optional[Union[Dict[str, click.core.Command], Sequence[click.core.Command]]] = None, **attrs: Any)[source]¶

Bases: click.core.Group

Commands can be abbreviated, e.g. t or tr for train, a for api, etc.

get_command(ctx, cmd_name)[source]¶: Given a context and a command name, this returns a Command object if it exists or returns None.

class mllaunchpad.cli.Settings[source]¶

Bases: object

property config¶

mllaunchpad.config module¶

This module contains functionality for reading and validating the configuration.

mllaunchpad.config.check_semantics(config_dict)[source]¶

mllaunchpad.config.get_validated_config(filename: str = './LAUNCHPAD_CFG.yml') → dict[source]¶

Read the configuration from file and return it as a dict object.

Parameters: filename (optional str, default: environment variable LAUNCHPAD_CFG or file ./LAUNCHPAD_CFG.yml) – Path to configuration file
Returns: dict with configuration
Return type: dict

mllaunchpad.config.get_validated_config_str(io: Union[AnyStr, TextIO]) → dict[source]¶

Read the configuration from a string or open file and return it as a dict object. This function exists mainly for making debugging and unit testing your model’s code easier.

Parameters: io (str or open text file handle) – Configuration as unicode string or b”byte string” or a open text file to read from
Returns: configuration
Return type: dict

mllaunchpad.config.validate_config(config_dict, required, path='')[source]¶

mllaunchpad.datasources module¶

class mllaunchpad.datasources.FileDataSink(identifier: str, datasink_config: Dict)[source]¶

Bases: mllaunchpad.resource.DataSink

DataSink for putting data into files.

See serves for the available types.

Configuration example:

datasinks:
  # ... (other datasinks)
  my_datasink:
    type: euro_csv  # `euro_csv` changes separators to ";" and decimals to "," w.r.t. `csv`
    path: /some/file.csv  # Can be URL, uses `df.to_csv` internally
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when fetching the data using `df.to_csv`
    dtypes_path: ./some/file.dtypes # optional: location for saving the csv's column dtypes info
  my_raw_datasink:
    type: text_file  # raw files can also be of type `binary_file`
    path: /some/file.txt  # Can be URL
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when writing the data using `fh.write`

When saving csv or euro_csv type formats, you can use the setting dtypes_path to specify a location where to save dtypes descriptions for the csv (that you can use later with FileDataSource’s dtypes_path setting). These dtypes will be enforced when reading the csv, which helps avoid problems when pandas.read_csv interprets data differently than you do. Use dtypes_path to enforce dtype parity between csv datasinks and datasources.

Using the raw formats binary_file and text_file, you can persist arbitrary data, as long as it can be represented as a bytes or a str object, respectively. text_file uses UTF-8 encoding. Please note that while possible, it is not recommended to persist DataFrame`s this way, because by adding format-specific code to your model, you’re giving up your code’s independence from the type of `DataSource/DataSink. Here’s an example for pickling an arbitrary object:

# config fragment:
datasinks:
  # ...
  my_pickle_datasink:
    type: binary_file
    path: /some/file.pickle
    tags: [train]
    options: {}

# code fragment:
import pickle
# ...
# in predict/test/train code:
my_pickle = pickle.dumps(my_object)
data_sinks["my_pickle_datasink"].put_raw(my_pickle)

put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) → None[source]¶

Write a pandas dataframe to file and optionally the dtypes if included in the configuration. The default is not to save the dataframe’s row index. Configure the DataSink’s options dict to pass keyword arguments to my_df.to_csv. If the directory path leading to the file does not exist, it will be created.

Example:

data_sinks["my_datasink"].put_dataframe(my_df)

Parameters

dataframe (pandas DataFrame) – The pandas dataframe to save
params (optional dict) – Currently not implemented
chunksize (optional bool) – Currently not implemented

put_raw(raw_data: Union[str, bytes], params: Optional[Dict] = None, chunksize: Optional[int] = None) → None[source]¶

Write raw (unstructured) data to file. If the directory path leading to the file does not exist, it will be created.

Example:

data_sinks["my_raw_datasink"].put_raw(my_data)

Parameters

raw_data (bytes or str) – The data to save (bytes for binary, string for text file)
params (optional dict) – Currently not implemented
chunksize (optional bool) – Currently not implemented

serves: List[str] = ['csv', 'euro_csv', 'text_file', 'binary_file']¶

class mllaunchpad.datasources.FileDataSource(identifier: str, datasource_config: Dict)[source]¶

Bases: mllaunchpad.resource.DataSource

DataSource for fetching data from files.

See serves for the available types.

Configuration example:

datasources:
  # ... (other datasources)
  my_datasource:
    type: euro_csv  # `euro_csv` changes separators to ";" and decimals to "," w.r.t. `csv`
    path: /some/file.csv  # Can be URL, uses `pandas.read_csv` internally
    expires: 0    # generic parameter, see documentation on DataSources
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when fetching the data using `pandas.read_csv`
    dtypes_path: ./some/file.dtypes # optional: location with the csv's column dtypes info
  my_raw_datasource:
    type: text_file  # raw files can also be of type `binary_file`
    path: /some/file.txt  # Can be URL
    expires: 0    # generic parameter, see documentation on DataSources
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when fetching the data using `fh.read`

When loading csv or euro_csv type formats, you can use the setting dtypes_path to specify a location with dtypes description for the csv (usually generated earlier by using FileDataSink’s dtypes_path setting). These dtypes will be enforced when reading the csv, which helps avoid problems when pandas.read_csv interprets data differently than you do. Use dtypes_path to enforce dtype parity between csv datasinks and datasources.

Using the raw formats binary_file and text_file, you can read arbitrary data, as long as it can be represented as a bytes or a str object, respectively. text_file uses UTF-8 encoding. Please note that while possible, it is not recommended to persist DataFrame`s this way, because by adding format-specific code to your model, you’re giving up your code’s independence from the type of `DataSource/DataSink. Here’s an example for unpickling an arbitrary object:

# config fragment:
datasources:
  # ...
  my_pickle_datasource:
    type: binary_file
    path: /some/file.pickle
    tags: [train]
    options: {}

# code fragment:
import pickle
# ...
# in predict/test/train code:
my_pickle = data_sources["my_pickle_datasource"].get_raw()
my_object = pickle.loads(my_pickle)

get_dataframe(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶

Get data as a pandas dataframe.

Example:

data_sources["my_datasource"].get_dataframe()

Parameters

params (optional dict) – Currently not implemented
chunksize (optional int) – Return an iterator where chunksize is the number of rows to include in each chunk.

Returns

DataFrame object, possibly cached according to config value of expires:

get_raw(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶

Get data as raw (unstructured) data.

Example:

data_sources["my_raw_datasource"].get_raw()

Parameters

params (optional dict) – Currently not implemented
chunksize (optional bool) – Currently not implemented

Returns

The file’s bytes (binary) or string (text) contents, possibly cached according to config value of expires:

Return type

bytes or str

serves: List[str] = ['csv', 'euro_csv', 'text_file', 'binary_file']¶

class mllaunchpad.datasources.OracleDataSink(identifier: str, datasink_config: Dict, dbms_config: Dict)[source]¶

Bases: mllaunchpad.resource.DataSink

DataSink for Oracle database connections.

Creates a long-living connection on initialization.

Configuration example:

dbms:
  # ... (other connections)
  my_connection:  # NOTE: You can use the same connection for several datasources and datasinks
    type: oracle
    host: host.example.com
    port: 1251
    user_var: MY_USER_ENV_VAR
    password_var: MY_PW_ENV_VAR  # optional
    service_name: servicename.example.com
    options: {}  # used as **kwargs when initializing the DB connection
# ...
datasinks:
  # ... (other datasinks)
  my_datasink:
    type: dbms.my_connection
    table: somewhere.my_table
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when storing the table using `my_df.to_sql`

put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) → None[source]¶

Store the pandas dataframe as a table. The default is not to store the dataframe’s row index. Configure the DataSink’s options dict to pass keyword arguments to df.to_sql.

Example:

data_sinks["my_datasink"].put_dataframe(my_df)

Parameters

dataframe (pandas DataFrame) – The pandas dataframe to store
params (optional dict) – Currently not implemented
chunksize (optional bool) – Currently not implemented

put_raw(raw_data, params: Optional[Dict] = None, chunksize: Optional[int] = None) → None[source]¶

Not implemented.

Raises: NotImplementedError – Raw/blob format currently not supported.

serves: List[str] = ['dbms.oracle']¶

class mllaunchpad.datasources.OracleDataSource(identifier: str, datasource_config: Dict, dbms_config: Dict)[source]¶

Bases: mllaunchpad.resource.DataSource

DataSource for Oracle database connections.

Creates a long-living connection on initialization.

Configuration example:

dbms:
  # ... (other connections)
  my_connection:  # NOTE: You can use the same connection for several datasources and datasinks
    type: oracle
    host: host.example.com
    port: 1251
    user_var: MY_USER_ENV_VAR
    password_var: MY_PW_ENV_VAR  # optional
    service_name: servicename.example.com
    options: {}  # used as **kwargs when initializing the DB connection
# ...
datasources:
  # ... (other datasources)
  my_datasource:
    type: dbms.my_connection
    query: SELECT * FROM somewhere.my_table where id = :id  # fill `:params` by calling `get_dataframe` with a `dict`
    expires: 0    # generic parameter, see documentation on DataSources
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when fetching the query using `pandas.read_sql`

get_dataframe(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶

Get the data as pandas dataframe.

Null values are replaced by numpy.nan.

Example:

data_sources["my_datasource"].get_dataframe({"id": 387})

Parameters

params (optional dict) – Query parameters to fill in query (e.g. replace query’s :id parameter with value 387)
chunksize (optional int) – Return an iterator where chunksize is the number of rows to include in each chunk.

Returns

DataFrame object, possibly cached according to config value of expires:

get_raw(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶

Not implemented.

Raises: NotImplementedError – Raw/blob format currently not supported.

serves: List[str] = ['dbms.oracle']¶

class mllaunchpad.datasources.SqlDataSink(identifier: str, datasource_config: Dict, dbms_config: Dict)[source]¶

Bases: mllaunchpad.resource.DataSink

DataSink for RedShift, Postgres, MySQL, SQLite, Oracle, Microsoft SQL (ODBC), and their dialects.

Uses SQLAlchemy under the hood, and as such, manages a connection pool automatically.

Please configure the dbms:<name>:connection_string:, which is a standard RFC-1738 URL with the syntax dialect[+driver]://[user:password@][host]/[dbname][?key=value..]. The exact URL is specific for the database you want to connect to. Find examples for all supported database dialects here.

Depending on the dialect you want to use, you might need to install additional drivers and packages. For example, for connecting to a kerberized Impala instance via ODBC, you need to:

Install Impala ODBC drivers for your OS,
pip install winkerberos thrift_sasl pyodbc sqlalchemy # use pykerberos for non-windows systems

If you are tasked with connecting to a particular database system, and don’t know where to start, researching on how to connect to it from SQLAlchemy will serve as a good starting point.

Other configuration in the dbms: section (besides connection_string:) is optional, but can be provided if deemed necessary:

Any dbms:-level settings other than type:, connection_string: and options: will be passed as additional keyword arguments to SQLAlchemy’s create_engine.
Any key-value pairs inside dbms:<name>:options: {} will be passed to SQLAlchemy as connect_args. If you append _var to the end of an argument key, its value will be interpreted as an environment variable name which ML Launchpad will attempt to get a value from. This can be useful for information like passwords which you do not want to store in the configuration file.

Configuration example:

dbms:
  # ... (other connections)
  # Example for connecting to a kerberized Impala instance via ODBC:
  my_connection:  # NOTE: You can use the same connection for several datasources and datasinks
    type: sql
    connection_string: mssql+pyodbc:///default?&driver=Cloudera+ODBC+Driver+for+Impala&host=servername.somedomain.com&port=21050&authmech=1&krbservicename=impala&ssl=1&usesasl=1&ignoretransactions=1&usesystemtruststore=1
    # pyodbc alternative: mssql+pyodbc:///?odbc_connect=DRIVER%3D%7BCloudera+ODBC+Driver+for+Impala%7D%3BHOST%3Dservername.somedomain.com%3BPORT%3D21050%3BAUTHMECH%3D1%3BKRBSERVICENAME%3Dimpala%3BSSL%3D1%3BUSESASL%3D1%3BIGNORETRANSACTIONS%3D1%3BUSESYSTEMTRUSTSTORE%3D1
    echo: True  # example for an additional SQLAlchemy keyword argument (logs the SQL) -- these are optional
    options: {}  # used as `connect_args` when creating the SQLAlchemy engine
# ...
datasinks:
  # ... (other datasinks)
  my_datasink:
    type: dbms.my_connection
    table: somewhere.my_table
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when storing the table using `my_df.to_sql`

put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) → None[source]¶

Store the pandas dataframe as a table. The default is not to store the dataframe’s row index. Configure the DataSink’s options dict to pass keyword arguments to df.to_sql.

Example:

data_sinks["my_datasink"].put_dataframe(my_df)

Parameters

dataframe (pandas DataFrame) – The pandas dataframe to store
params (optional dict) – Currently not implemented
chunksize (optional bool) – Currently not implemented

put_raw(raw_data, params: Optional[Dict] = None, chunksize: Optional[int] = None) → None[source]¶

Not implemented.

Raises: NotImplementedError – Raw/blob format currently not supported.

serves: List[str] = ['dbms.sql']¶

class mllaunchpad.datasources.SqlDataSource(identifier: str, datasource_config: Dict, dbms_config: Dict)[source]¶

Bases: mllaunchpad.resource.DataSource

DataSource for RedShift, Postgres, MySQL, SQLite, Oracle, Microsoft SQL (ODBC), and their dialects.

Uses SQLAlchemy under the hood, and as such, manages a connection pool automatically.

Please configure the dbms:<name>:connection_string:, which is a standard RFC-1738 URL with the syntax dialect[+driver]://[user:password@][host]/[dbname][?key=value..]. The exact URL is specific for the database you want to connect to. Find examples for all supported database dialects here.

Depending on the dialect you want to use, you might need to install additional drivers and packages. For example, for connecting to a kerberized Impala instance via ODBC, you need to:

Install Impala ODBC drivers for your OS,
pip install winkerberos thrift_sasl pyodbc sqlalchemy # use pykerberos for non-windows systems

If you are tasked with connecting to a particular database system, and don’t know where to start, researching on how to connect to it from SQLAlchemy will serve as a good starting point.

Other configuration in the dbms: section (besides connection_string:) is optional, but can be provided if deemed necessary:

Any dbms:-level settings other than type:, connection_string: and options: will be passed as additional keyword arguments to SQLAlchemy’s create_engine.
Any key-value pairs inside dbms:<name>:options: {} will be passed to SQLAlchemy as connect_args. If you append _var to the end of an argument key, its value will be interpreted as an environment variable name which ML Launchpad will attempt to get a value from. This can be useful for information like passwords which you do not want to store in the configuration file.

Configuration example:

dbms:
  # ... (other connections)
  # Example for connecting to a kerberized Impala instance via ODBC:
  my_connection:  # NOTE: You can use the same connection for several datasources and datasinks
    type: sql
    connection_string: mssql+pyodbc:///default?&driver=Cloudera+ODBC+Driver+for+Impala&host=servername.somedomain.com&port=21050&authmech=1&krbservicename=impala&ssl=1&usesasl=1&ignoretransactions=1&usesystemtruststore=1
    # pyodbc alternative: mssql+pyodbc:///?odbc_connect=DRIVER%3D%7BCloudera+ODBC+Driver+for+Impala%7D%3BHOST%3Dservername.somedomain.com%3BPORT%3D21050%3BAUTHMECH%3D1%3BKRBSERVICENAME%3Dimpala%3BSSL%3D1%3BUSESASL%3D1%3BIGNORETRANSACTIONS%3D1%3BUSESYSTEMTRUSTSTORE%3D1
    echo: True  # example for an additional SQLAlchemy keyword argument (logs the SQL) -- these are optional
    options: {}  # used as `connect_args` when creating the SQLAlchemy engine
# ...
datasources:
  # ... (other datasources)
  my_datasource:
    type: dbms.my_connection
    query: SELECT * FROM somewhere.my_table WHERE id = :id  # fill `:params` by calling `get_dataframe` with a `dict`
    expires: 0    # generic parameter, see documentation on DataSources
    tags: [train] # generic parameter, see documentation on DataSources and DataSinks
    options: {}   # used as **kwargs when fetching the query using `pandas.read_sql`

get_dataframe(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶

Get the data as pandas dataframe.

Null values are replaced by numpy.nan.

Example:

my_df = data_sources["my_datasource"].get_dataframe({"id": 387})

Parameters

params (optional dict) – Query parameters to fill in query (e.g. replace query’s :id parameter with value 387)
chunksize (optional int) – Return an iterator where chunksize is the number of rows to include in each chunk.

Returns

DataFrame object, possibly cached according to config value of expires:

get_raw(params: Optional[Dict] = None, chunksize: Optional[int] = None)¶

Not implemented.

Raises: NotImplementedError – Raw/blob format currently not supported.

serves: List[str] = ['dbms.sql']¶

mllaunchpad.datasources.ensure_dir_to(file_path)[source]¶

mllaunchpad.datasources.fill_nas(df: pandas.core.frame.DataFrame, as_generator: bool = False) → Union[pandas.core.frame.DataFrame, Generator][source]¶

mllaunchpad.datasources.get_connection_args(dbms_config: Dict) → Dict[source]¶: Fill “_var”-suffixed configuration items from environment variables

mllaunchpad.logutil module¶

mllaunchpad.logutil.init_logging(filename='./LAUNCHPAD_LOG.yml', verbose=False)[source]¶: Only called from wsgi or cli module (mllaunchpad-as-an-app). It’s important to not change logging/warning config from the library-only code.

mllaunchpad.model_actions module¶

Convenience functions for executing training, testing and prediction

mllaunchpad.model_actions.clear_caches()[source]¶

mllaunchpad.model_actions.predict(complete_conf: Dict, arg_dict: Optional[Dict] = None, cache: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None, use_live_code: bool = False)[source]¶

Carry out prediction for the model specified in the configuration.

Parameters

complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
arg_dict (optional Dict, default: None) – Arguments dict for the prediction (analogous to what it would get from a web API)
model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.
use_live_code (optional bool, default: False) – Use the current predict function instead of the one persisted with the model in the model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

model’s prediction output

mllaunchpad.model_actions.retest(complete_conf: Dict, cache: bool = True, persist: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]¶

Retest a model as specified in the configuration and persist its test metrics in the model store.

Parameters

complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
persist (optional bool, default: True) – Whether to update the model in model_cache: with the test metrics. This parameter exists mainly for making debugging and unit testing your model’s code easier.
model (optional object implementing ModelInterface, default: None) – Test this model instead of loading it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

test_metrics

mllaunchpad.model_actions.train_model(complete_conf: Dict, cache: bool = True, persist: bool = True, test: bool = True, model: Optional[mllaunchpad.model_interface.ModelInterface] = None)[source]¶

Train and test a model as specified in the configuration and persist it in the model store.

Parameters

complete_conf (dict) – configuration dict
cache (optional bool, default: True) – Whether to cache the data sources/sinks and helper objects (cache lookup is done by model name and model version). If in doubt, leave at default.
persist (optional bool, default: True) – Whether to store the trained model in the location configured by model_cache:. This parameter exists mainly for making debugging and unit testing your model’s code easier.
test (optional bool, default: True) – Whether to test the model after training. This parameter exists mainly for making debugging and unit testing your model’s code easier.
model (optional object implementing ModelInterface, default: None) – Use this model as previous model instead trying to load it from model_store. This parameter exists mainly for making debugging and unit testing your model’s code easier.

Returns

Tuple of (object implementing ModelInterface, metrics)

mllaunchpad.model_actions.train_report() → Iterator[Dict[str, Any]][source]¶

mllaunchpad.model_interface module¶

class mllaunchpad.model_interface.ModelInterface(contents=None)[source]¶

Bases: abc.ABC

Abstract model interface for Data-Scientist-created models. Please inherit from this class when creating your model to make it usable for ModelApi.

You don’t need to create this object yourself when training. It is created automatically and the model/info returned from create_trained_model is made accessible to you through the self.contents attribute.

abstract predict(model_conf, data_sources, data_sinks, model, args_dict)[source]¶

Implement this method, including data prep/feature creation based on argsDict. argsDict can also contain an id which the model can use to fetch data from any data_sources. (Feel free to put common code for preparing data into another function, class, library, …)

Params:: model_conf: the model configuration dict from the config file data_sources: dict containing the data sources data_sinks: dict containing the data sinks, as configured in the config file. model: your model object (whatever you returned in create_trained_model) argsDict: parameters the API was called with, dict of strings (any type conversion needs to be done by you)
Return:: Prediction result as a dictionary/list structure which will be automatically turned into JSON.

class mllaunchpad.model_interface.ModelMakerInterface[source]¶

Bases: abc.ABC

Abstract model factory interface for Data-Scientist-created models. Please inherit from this class and put your training code into the method “create_trained_model”. This method will be called by the framework when your model needs to be (re)trained.

Why not simply use static methods?: We want to make it possible for create_trained_model to pass extra info test_trained_model without extending the latter with optional keyword arguments that might be confusing for the 90% of cases where they are not needed. So we rely on the smarts of the person inheriting from this class to find a solution/shortcuts if they want to do more difficult things e.g. want to do the train/test split themselves.

abstract create_trained_model(model_conf, data_sources, data_sinks, old_model=None)[source]¶

Implement this method, including data prep/feature creation. No need to test your model here. Put testing code in test_trained_model, which will be called automatically after training. (Feel free to put common code for preparing data into another function, class, library, …)

Params:: model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when training. old_model: contains an old model, if it exists, which can be used for incremental training. default: None
Return:: The trained model/data/anything which you want to use in the predict() function. (usually simply a fitted model object, but can be anything, like a dict of several models, a model with some extra info, etc.) (Whatever you return here gets automatically stuffed into your ModelInterface-inherited object and is accessible there using predict’s model parameter (or the self.contents attribute.))

abstract test_trained_model(model_conf, data_sources, data_sinks, model)[source]¶

Implement this method, including data prep/feature creation. This method will be called to re-test a model, e.g. to check whether it has to be re-trained. (Feel free to put common code for preparing data into another function, class, library, …)

Params:: model_conf: the model configuration dict from the config file data_sources: dict containing the data sources (this includes your train/validation/test data), as configured in the config file data_sinks: dict containing the data sinks, as configured in the config file. Usually unused when testing. model: your model object (whatever you returned in create_trained_model)
Return:: Return a dict of metrics (like ‘accuracy’, ‘f1’, ‘confusion_matrix’, etc.)

mllaunchpad.resource module¶

class mllaunchpad.resource.CacheDict(*args, **kwds)[source]¶: Bases: collections.OrderedDict

class mllaunchpad.resource.CachedDataSource(name, bases, dct)[source]¶

Bases: type

Metaclass to Auto-apply decorators “@cached” to data getters. https://stackoverflow.com/questions/10067262/automatically-decorating-every-instance-method-in-a-class

classmethod cached(func)[source]¶: This decorator is automatically applied to get_dataframe and get_raw methods to enable caching.

class mllaunchpad.resource.DataSink(identifier: str, datasink_config: Dict, sub_config: Optional[Dict] = None)[source]¶

Bases: object

Interface, used by the Data Scientist’s model to persist data (usually prediction results). Concrete DataSinks (for files, data bases, etc.) need to inherit from this class.

abstract put_dataframe(dataframe: pandas.core.frame.DataFrame, params: Optional[Dict] = None, chunksize: Optional[int] = None) → None[source]¶

abstract put_raw(raw_data: Union[str, bytes], params: Optional[Dict] = None, chunksize: Optional[int] = None) → None[source]¶

serves: List[str] = []¶

class mllaunchpad.resource.DataSource(identifier: str, datasource_config: Dict, sub_config: Optional[Dict] = None)[source]¶

Bases: object

Interface, used by the Data Scientist’s model to get its data from. Concrete DataSources (for files, data bases, etc.) need to inherit from this class.

get_dataframe(params: Optional[Dict] = None, chunksize: Optional[int] = None)[source]¶

get_raw(params: Optional[Dict] = None, chunksize: Optional[int] = None)[source]¶

serves: List[str] = []¶

class mllaunchpad.resource.ModelStore(config: Union[Dict, str])[source]¶

Bases: object

Deals with persisting, loading, updating metrics metadata of models. Abstracts away how and where the model is kept.

TODO: Smarter querying like ‘get me the model with the currently (next) best metrics which serves a particular API.’

add_to_train_report(name: str, value)[source]¶

dump_trained_model(complete_conf, model, metrics)[source]¶

Save a model object in the model store. Some metadata will also be saved along the model, including the metrics which is the second parameter.

Params:: model_conf: the config dict of our model model: the model object to store metrics: metrics dictionary
Returns:: Nothing

list_models()[source]¶

Get information on all available versions of trained models.

Side note: This also includes backups of models that have been re-trained without changing the version number (they reside in the subdirectory previous). Please note that these backed up models are just listed for information and are not available for loading (one would have to restore them by moving them up a directory level from previous.

Example:

ms = mllp.ModelStore("./model_store")
all_models = ms.list_models()

# An example of what a ``list_models()``'s result would look like:
{
    iris: {
        1.0.0: { ... complete metadata of this version number ... },
        1.1.0: { ... },
        latest: { ... duplicate of metadata of highest available version number, here 1.1.0 ... },
        backups: [ {...}, {...}, ... ]
    },
    my_other_model: {
        1.0.1: { ... },
        2.0.0: { ... },
        latest: { ... },
        backups: []
    }
}

Returns: Dict with information on all available trained models.

load_trained_model(model_conf)[source]¶

Load a model object from the model store. Some metadata will also be loaded along the model.

Params:: model_conf: the config dict of our model
Returns:: Tuple of model object and metadata dictionary

update_model_metrics(model_conf, metrics)[source]¶: Update the test metrics for a previously stored model

mllaunchpad.resource.create_data_sources_and_sinks(config: Dict, tags: Optional[Iterable[str]] = None) → Tuple[Dict[str, mllaunchpad.resource.DataSource], Dict[str, mllaunchpad.resource.DataSink]][source]¶

Creates the data sources as defined in the configuration dict. Filters them by tag.

Params:: config: configuration dictionary tags: optionally filter for only matching datasources no value(s) = match all datasources
Returns:: dict with keys=datasource names, values=initialized DataSource objects

mllaunchpad.resource.get_user_pw(user_var: str, password_var: str) → Tuple[str, Optional[str]][source]¶

mllaunchpad.resource.order_columns(obj: Union[pandas.core.frame.DataFrame, numpy.ndarray, Dict])[source]¶

Order the columns of a DataFrame, a dict, or a Numpy structured array. Use this on your training data right before passing it into the model. This will guarantee that the model is trained with a reproducible column order.

Same in your test code.

Most importantly, use this also in your predict method, as the incoming args_dict does not have a deterministic order.

Params:: obj: a DataFrame, a dict, or a Numpy structured array
Returns:: The obj with columns ordered lexicographically

mllaunchpad.resource.to_plain_python_obj(possible_ndarray)[source]¶

mllaunchpad.yaml_loader module¶

class mllaunchpad.yaml_loader.SafeIncludeLoader(stream)[source]¶

Bases: yaml.loader.SafeLoader

A subclass of SafeLoader which supports !include file references.

include(node)[source]¶

yaml_constructors = {'tag:yaml.org,2002:null': <function SafeConstructor.construct_yaml_null>, 'tag:yaml.org,2002:bool': <function SafeConstructor.construct_yaml_bool>, 'tag:yaml.org,2002:int': <function SafeConstructor.construct_yaml_int>, 'tag:yaml.org,2002:float': <function SafeConstructor.construct_yaml_float>, 'tag:yaml.org,2002:binary': <function SafeConstructor.construct_yaml_binary>, 'tag:yaml.org,2002:timestamp': <function SafeConstructor.construct_yaml_timestamp>, 'tag:yaml.org,2002:omap': <function SafeConstructor.construct_yaml_omap>, 'tag:yaml.org,2002:pairs': <function SafeConstructor.construct_yaml_pairs>, 'tag:yaml.org,2002:set': <function SafeConstructor.construct_yaml_set>, 'tag:yaml.org,2002:str': <function SafeConstructor.construct_yaml_str>, 'tag:yaml.org,2002:seq': <function SafeConstructor.construct_yaml_seq>, 'tag:yaml.org,2002:map': <function SafeConstructor.construct_yaml_map>, None: <function SafeConstructor.construct_undefined>, '!include': <function SafeIncludeLoader.include>}¶

mllaunchpad package¶

Submodules¶

mllaunchpad.api module¶

mllaunchpad.cli module¶

mllaunchpad.config module¶

mllaunchpad.datasources module¶

mllaunchpad.logutil module¶

mllaunchpad.model_actions module¶

mllaunchpad.model_interface module¶

mllaunchpad.resource module¶

mllaunchpad.yaml_loader module¶

ML Launchpad

Navigation

Related Topics