Usage

Command Line Interface

ML Launchpad’s command line interface is usually only used when developing and preparing a machine learning application. When actually running the API in production, a WSGI server (e.g. Gunicorn or Waitress) is used to run mllaunchpad.wsgi:application instead (the config file is then provided via an environment variable).

All commands (train, retest, predict, api and generate-raml) can be abbreviated, so you can use e.g. mllaunchpad t or mllaunchpad pred to save some keystrokes.

mllaunchpad

Train, test or run a config file’s model.

mllaunchpad [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

-v, --verbose

Print debug messages.

-c, --config <config>

Use this configuration file. [default: look for env var LAUNCHPAD_CFG or ./LAUNCHPAD_CFG.yml]

-l, --log-config <log_config>

Use this log configuration file. [default: look for env var LAUNCHPAD_LOG or ./LAUNCHPAD_LOG.yml]

api

Run API server in unsafe debug mode.

mllaunchpad api [OPTIONS]

generate-raml

Generate and print RAML template from DATASOURCE_NAME.

The datasource named DATASOURCE_NAME in the config will be used to create the API’s query parameters (from columns), types, and examples.

mllaunchpad generate-raml [OPTIONS] DATASOURCE_NAME

Arguments

DATASOURCE_NAME

Required argument

predict

Run prediction on features from JSON file ( - for stdin).

Example JSON: { “petal.width”: 1.4, “petal.length”: 2.0,
“sepal.width”: 1.8, “sepal.length”: 4.0 }
mllaunchpad predict [OPTIONS] [JSON_FILE]

Arguments

JSON_FILE

Optional argument

retest

Retest existing model, update metrics.

mllaunchpad retest [OPTIONS]

train

Run training, store created model and metrics.

mllaunchpad train [OPTIONS]

Environment variables

LAUNCHPAD_CFG

(Optional) path to configuration file

LAUNCHPAD_LOG

(Optional) path to logging configuration file

Configuration

See separate page Configuration.

What about support for R, Spark, <other technology>?

ML Launchpad is designed to be as technology-agnostic and flexible as possible. For machine learning technologies, this means that it does not care whether you use it with SciKit-Learn, PyTorch, spaCy, etc. Just import the Python packages you need and enjoy. See the tutorial in the next section for an example using SciKit-Learn.

For interfacing with the outside world (getting data, etc.), we created interfaces for extending this functionality. The most common kinds are already supported out of the box. For getting and persisting data, look into inheriting DataSources and DataSinks. For providing your model results in other ways as the provided WSGI API (events, Azure functions, etc), look into the mllaunchpad API (particularly get_validated_config() and predict()).

That said, we already accumulated some partial or complete solutions, and the one you need might already be there:

  • Oracle, Impala, Hive, etc. support is covered by SqlDataSource). It uses SQLAlchemy, which adds a lot of flexibility to the datasource configuration. Please see the SqlDataSource docs for more information. There also are some special classes like OracleDataSource and, in the examples, ImpalaDataSource, but those were made before SqlDataSource, and we suggest trying SqlDataSource first.

  • R support works by using and adapting the r_example* files in the examples directory (experimental). You leave r_model.py as is, configure it as the model:module:, where you also configure model:r_file and model:r_dependencies with your script and R requirements. You will have to have R installed, as well as the Python package rpy2[pandas].

  • Spark support is available through the spark_datasource.py module in the examples (experimental). Copy it into your project and include it in your config using the plugins: directive. Its detailed use is documented in the module itself.

  • Containerization is straightforward to do – build an image that exposes the ML Launchpad REST API:

    # Example Dockerfile
    ARG PYTHON=3.7
    FROM python:${PYTHON}-slim-buster as mllp
    RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
        vim \
        unixodbc-dev \
        unixodbc \
        libpq-dev \
     && apt-get clean \
     && apt-get autoremove -y \
     && rm -rf /var/lib/apt/lists/*
    WORKDIR /var/www/mllp/app
    COPY . .  # In your project, be selective in what you put into the image.
    RUN pip install -r requirements.txt
    RUN pip install gunicorn
    RUN python -m mllaunchpad -c my_config.yml train  # If not pre-trained earlier.
    EXPOSE 5000
    CMD gunicorn --workers 4 --bind 0.0.0.0:5000 mllaunchpad.wsgi
    
  • Azure/Firebase/AWS lambda functions for prediction can be easily created using the mllaunchpad API:

    import json
    import azure.functions as func
    import mllaunchpad  # see https://mllaunchpad.readthedocs.io/en/stable/mllaunchpad.html
    
    conf = mllaunchpad.get_validated_config("my_cfg_file_or_stream_or_url.yml")  # None=use LAUNCHPAD_CFG env var
    
    def main(req: func.HttpRequest) -> func.HttpResponse:
        # (you need to validate params yourself here, skipped in this example)
        result = mllaunchpad.predict(conf, arg_dict=req.params)
        return func.HttpResponse(json.dumps(result), mimetype="application/json")
    
  • For any other technology, there’s a good chance that you can tackle it with one of these mechanisms (extending DataSources/DataSinks or through the API). If you are unsure, please create an issue.

Tutorial

This tutorial will guide you through using ML Launchpad to publish a small machine learning project as a Web API.

Let’s assume that you have developed a Python script called tree_script.py which contains the code to train, test and apply your model from Python:

my_project/
    iris_train.csv
    iris_holdout.csv
    tree_script.py

Contents of tree_script.py:

import sys

import pandas as pd
from sklearn import tree
from sklearn.metrics import accuracy_score, confusion_matrix

def train():
    df = pd.read_csv('iris_train.csv')
    X = df.drop('variety', axis=1)
    y = df['variety']
    model = tree.DecisionTreeClassifier()
    model.fit(X, y)
    return model


def test(model):
    df = pd.read_csv('iris_holdout.csv')
    X_test = df.drop('variety', axis=1)
    y_test = df['variety']
    y_predict = model.predict(X_test)
    acc = accuracy_score(y_test, y_predict)
    conf = confusion_matrix(y_test, y_predict).tolist()
    metrics = {'accuracy': acc, 'confusion_matrix': conf}
    return metrics


def predict(model, args_dict):
    # Create DF explicitly. No guarantee that dict keys are in correct order,
    # so we have to make sure *manually* that they match the column order we used
    # when training the model:
    X = pd.DataFrame({
        'sepal.length': [args_dict['sepal.length']],
        'sepal.width': [args_dict['sepal.width']],
        'petal.length': [args_dict['petal.length']],
        'petal.width': [args_dict['petal.width']]
        })
    y = model.predict(X)[0]
    return {'prediction': y}


if __name__ == '__main__':
    args = dict(zip([n for n in sys.argv[1::2]], [float(v) for v in sys.argv[2::2]]))
    my_model = train()
    print('metrics:', test(my_model))
    pred = predict(my_model, args)
    print('prediction result:', pred)

    # Example:
    # $ python tree_script.py sepal.length 3 sepal.width 2.7 petal.length 4.5 petal.width 3.5
    # metrics: {'accuracy': 0.95, 'confusion_matrix': [[6, 0, 0], [0, 7, 0], [0, 1, 6]]}
    # prediction result: {'prediction': 'Virginica'}

This script can be called from the command line and guesses the variety of iris from some physical measurements provided as command line arguments. It somewhat wastefully trains a new model every time it is called, and does not check the validity of the arguments at all. Besides making the model available as a Web API, ML Launchpad will also solve these two problems.

To use ML Launchpad, install it first using:

$ pip install mllaunchpad

Now, we’ll create a new Python file called tree_model.py in which we will fill in the blanks:

my_project/
    iris_train.csv
    iris_holdout.csv
    tree_script.py
    tree_model.py

The file tree_model.py looks like this at first:

from mllaunchpad import ModelInterface, ModelMakerInterface
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import tree
import pandas as pd
import logging

logger = logging.getLogger(__name__)

class MyTreeModelMaker(ModelMakerInterface):
    """Creates a Iris prediction model"""

    def create_trained_model(self, model_conf, data_sources, data_sinks, old_model=None):
        ...

        return model

    def test_trained_model(self, model_conf, data_sources, data_sinks, model):
        ...

        return metrics


class MyTreeModel(ModelInterface):
    """Uses the created Iris prediction model"""

    def predict(self, model_conf, data_sources, data_sinks, model, args_dict):
        ...

        return output

You can find a template like this in ML Launchpad’s examples (download the examples, or copy-paste from TEMPLATE_model.py on GitHub).

The three methods create_trained_model(), test_trained_model() and predict() correspond to the three functions in our script above. We can essentially copy and paste the contents of our three functions into those, but we will need to change some details to make the code work with ML Launchpad.

Here, we’ll make use of the method arguments data_sources and model. See model_interface for details on all available arguments.

If we call our training DataSource petals and our test DataSource petals_test, our completed tree_model.py looks like this (we highlight changed code with #comments):

from mllaunchpad import ModelInterface, ModelMakerInterface, order_columns
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import tree
import pandas as pd
import logging

logger = logging.getLogger(__name__)

class MyTreeModelMaker(ModelMakerInterface):
    """Creates a Iris prediction model"""

    def create_trained_model(self, model_conf, data_sources, data_sinks, old_model=None):
        # use data_source instead of reading CSV ourselves:
        df_unordered = data_sources['petals'].get_dataframe()
        df = order_columns(df_unordered)  # make col order reproducible for API use
        X = df.drop('variety', axis=1)
        y = df['variety']
        model = tree.DecisionTreeClassifier()
        model.fit(X, y)
        return model

    def test_trained_model(self, model_conf, data_sources, data_sinks, model):
        # use data_source instead of reading CSV ourselves:
        df_unordered = data_sources['petals_test'].get_dataframe()
        df = order_columns(df_unordered)  # make col order reproducible for API use
        X_test = df.drop('variety', axis=1)
        y_test = df['variety']
        y_predict = model.predict(X_test)
        acc = accuracy_score(y_test, y_predict)
        conf = confusion_matrix(y_test, y_predict).tolist()
        metrics = {'accuracy': acc, 'confusion_matrix': conf}
        return metrics


class MyTreeModel(ModelInterface):
    """Uses the created Iris prediction model"""

    def predict(self, model_conf, data_sources, data_sinks, model, args_dict):
        # No changes required, but instead of this clumsy construct here...
        # X = pd.DataFrame({
        #     'sepal.length': [args_dict['sepal.length']],
        #     'sepal.width': [args_dict['sepal.width']],
        #     'petal.length': [args_dict['petal.length']],
        #     'petal.width': [args_dict['petal.width']]
        #     })
        # ... we can use this much shorter method thanks to using
        # order_columns earlier, guaranteeing deterministic column ordering:
        X = order_columns(pd.DataFrame(args_dict, index=[0]))
        y = model.predict(X)[0]
        return {'prediction': y}

So we are now getting our data from the data_source arguments instead of directly from csv files, and we get our model object passed as an argument, same as before.

The three methods return the same things as our own functions:

  • create_trained_model() returns a trained model object (can be pretty much anything),

  • test_trained_model() returns a dict with metrics (can also contain lists, numpy arrays or pandas DataFrames), and

  • predict() returns a prediction (usually a dict, but can also contain lists, numpy arrays or pandas DataFrames).

Sidenote: To save additional information while training for traceability’s sake, use mllaunchpad.report() in your train and test code. The metadata thus saved resides in the model store together with the model. By default, it includes basic info such as the configuration (see below), some system info, and the test metrics. When done with training, you can retrieve metadata of all models in the model store from Python by using mllaunchpad.list_models().

Next, we will configure some extra info about our model, as well as tell ML Launchpad where to find the petal and petal_test DataSources.

Create a file called tree_cfg.yml:

my_project/
    iris_train.csv
    iris_holdout.csv
    tree_model.py
    tree_cfg.yml

(We’re done with our original tree_script.py so I’ve removed it)

Contents of tree_cfg.yml:

datasources:
  petals:
    type: csv
    path: ./iris_train.csv  # The string can also be a URL. Valid URL schemes include http, ftp, s3, and file.
    expires: 0  # -1: never (=cached forever), 0: immediately (=no caching), >0: time in seconds.
    options: {}
    tags: train
  petals_test:
    type: csv
    path: ./iris_holdout.csv
    expires: 3600
    options: {}
    tags: test

model_store:
  location: ./model_store  # Just in current directory for now

model:
  name: TreeModel
  version: '0.0.1'  # use semantic versioning (<breaking>.<adding>.<fix>), first segment will be used in API url as e.g. .../v1/...
  module: tree_model  # same as file name without .py
  train_options: {}
  predict_options: {}

api:
  name: iris  # name of the service api
  raml: tree.raml
  preload_datasources: False  # Load datasources into memory before any predictions. Only makes sense with caching.

Here, we define our datasources so ML Launchpad knows where to find the data we refer to from our model. Besides csv files, other types of DataSources are supported, and extending DataSources is also possible. (see DataSources and DataSinks for more information on supported builtin DataSources).

The model_store is just a directory where all trained models will be stored together with their metrics.

The model section gives our model a name and version which will be used to uniquely identify it when saving/loading. Here, we also provide the importable name of our tree_model.py, which is just tree_model. If it were in a package (directory) called something, we would write something.tree_model instead. It’s a good idea to make sure our model is in Pythons path (sys.path or PYTHONPATH) so it can be found when ML Launchpad wants to import it.

The api section provides details on the Web API we want to publish. This section is maybe surprisingly empty. The reason is that the API definition is off-loaded into a RESTful API Markup Language (RAML) file.

You can genereate a RAML file using the command line tool that has been installed when you installed ML Launchpad:

$ mllaunchpad --config tree_cfg.yml generate-raml petals >tree.raml

This creates the API definition file tree.raml using the columns and their types in the petals datasource for defining parameters. We still need to adapt this file a little because it also lists our target variable variety as an input parameter, which we don’t want, so we edit the file and remove these lines:

variety:
  displayName: Friendly Name of variety
  type: string
  description: Description of what variety really is
  example: 'Versicolor'
  required: true

This is the only change which is necessary from a technical standpoint. Feel free to read the RAML file and improve the template descriptions there, correct mythings to something that makes sense, like varieties, adapt the output format to what you want to use, and so on.

Our model is done! Let’s try it out.

$ mllaunchpad --config tree_cfg.yml train

Now we have a trained model in our model_store. Let’s run a test Web API (only for debug purposes, see here for running production APIs):

$ mllaunchpad --config tree_cfg.yml api

We can find a test URL in our generated tree.raml. Just remove the &variety=... part, and open the link http://127.0.0.1:5000/iris/v0/mythings?sepal.length=5.6&sepal.width=2.7&petal.length=4.2&petal.width=1.3 e.g. in Chrome. You can see the result of our model’s prediction immediately:

{
    "prediction": "Versicolor"
}

Automatic input validation is included for free. Try changing the URL to provide a string value instead of a number, or remove one of the parameters, and you get a message explaining what is wrong.

What we have now is what is called RESTful API. Web APIs like this are easy to use by other systems or web sites to include your model’s predictions in their functionality.

Here’s a quick hacked-together HTML page which makes the predictions available to an end user:

<!DOCTYPE html>
<html><body>
    <h2>Iris Tree Demo</h2>
    <p>
        Sepal Width: <input id="sl" type="range" min="0.1" max="7" step="0.1"><br>
        Sepal Length: <input id="sw" type="range" min="0.1" max="7" step="0.1"><br>
        Petal Length: <input id="pl" type="range" min="0.1" max="7" step="0.1"><br>
        Petal Width: <input id="pw" type="range" min="0.1" max="7" step="0.1"><br>
    </p>
    <p id="output"></p>
    <script>
        function predict() {
            let sl = document.querySelector('#sl').value;
            let sw = document.querySelector('#sw').value;
            let pl = document.querySelector('#pl').value;
            let pw = document.querySelector('#pw').value;
            fetch(`http://127.0.0.1:5000/iris/v0/mythings?sepal.length=${sl}&sepal.width=${sw}&petal.length=${pl}&petal.width=${pw}`)
            .then(function(response) {
                console.log(response);
                return response.json();
            })
            .then(function(myJson) {
                console.log(myJson);
                document.querySelector('#output').innerHTML =
                  `This is an example of the ${myJson.iris_variety} variety`;
            });
        }
        let inputs = document.querySelectorAll('input');
        for (let input of inputs) {
            input.addEventListener('change', predict, false);
        }
    </script>
</body></html>

If you put prototype HTML interfaces like this in a static subfolder, then they will be accessible at e.g. http://127.0.0.1:5000/static/tree.html. Keep in mind that this is only for demo/debug usage, not for production. The position of the static subfolder is governed by the api:root_path key (with a default value of .) in your config file.

You can find this and other examples here (download). To run the tree example from this tutorial:

$ cd examples
$ mllaunchpad --config tree_cfg.yml train
$ mllaunchpad --config tree_cfg.yml api

Then open http://127.0.0.1:5000/static/tree.html in your browser.

To learn more, have a look at the examples provided in mllaunchpad’s GitHub repository (examples as zip file).