Configuration

The configuration file is the glue that holds your ML-Launchpad-based application together. It links the things on the “inside”, that is, your model’s implementation, to the things on the “outside”, such as the data connection (DataSources and DataSinks), as well as the API configuration.

Sidenote: You can use this to your advantage when developing and testing your machine learning algorithm by using different configuration files for different purposes of your development life cycle. That way, you can cleanly separate different environments like development/testing/production, without having to touch your code (using the same build) when switching between these environments.

For ML Launchpad to know how to do its job, it always needs a configuration. To accommodate different ways of using ML Launchpad, you have different options of providing the configuration. From most common to least common:

  • Provide the path to the config file on the command line (--config or -c option).

  • Set the environment variable LAUNCHPAD_CFG to the path of the config file.

  • Put a config file named LAUNCHPAD_CFG.yml in the current working directory.

  • Call get_validated_config() with the path of the config file to get a config dict and provide it as an argument when calling predict(), retest() and/or train_model().

Way of providing config

When to use

mllaunchpad --config <file>

when developing; in some cases training

LAUNCHPAD_CFG environment var

developing; when deployed (e.g. in production)

./LAUNCHPAD_CFG.yml

deployed

in code

when using mllaunchpad functionality from another app instead of WSGI, e.g. as an Azure function

Note: Besides LAUNCHPAD_CFG, there is also the LAUNCHPAD_LOG environment variable, which, if provided, will be used as the logging configuration file.

Config File

The configuration file is written in YAML (.yml) format (used internally as a Python dict). It uses UTF-8 encoding.

Here’s an example configuration with comments:

plugins:  # Optionally specify any additional imports (only external DataSources/-Sinks for now, cf. ``DataSources``)
  - bogusdatasource
  - records_datasource

datasources:  # This section is optional. Places to get data from, and how.
  petals:  # Name by which you want to refer to the datasource, e.g. using ``data_sources["petals"]``/
    # The properties ``type``, ``expires``, ``options`` and ``tags`` are present
    # in all types of datasources/datasinks.
    # All other properties are specific to the datasource type.
    type: csv  # Generic; the type of the datasource
    path: ./iris_train.csv  # Can also be a URL. Valid URL schemes: http, ftp, s3, and file
    expires: 0  # -1: never (=cached forever), 0: immediately (=no caching), >0: time in seconds
    cache_size: 10  # Optional: maximum number of different results to keep in memory, default=32
    options: {}  # Special kwargs to pass to the datasource's implementation
    tags: train  # String or list of strings. Valid are "train", "test" and/or "predict".
  petals_test:
    type: csv
    path: ./iris_holdout.csv
    expires: 3600
    options: {}
    tags: test
  # You can define as many datasources and datasinks as you like.
  # The tags "train", "test" and/or "predict" will determine which datasources/datasink
  # will be provided to which functions in your model implementation.
  # Any combination of tags with datasources/datasinks is valid.

# datasinks:  # This section is optional. Places to put data. NOT needed for prediction outputs, unless you require batch output, special file formats, etc.
  # The configuration structure of datasinks is equivalent to that of datasources.

model_store:  # Required. Where your model and metadata is persisted.
  location: ./model_store  # Directory on file system (local or remote).

model:  # Required. Details about your model's implementation.
  name: TreeModel
  version: '0.0.1'  # Use semantic versioning (<breaking>.<adding>.<fix>), first segment will be used in API url as e.g. .../v1/...
  module: tree_model  # Main module of your functionality. Same as source code file name without .py
  # Put custom properties for your implementation here.
  # For example, to configure NLP-related aspects of your model (language, etc.),
  # to perform fewer iterations for testing purposes, etc.
  # It is not recommended to put low-level hyperparameters here.

api:  # Optional. Details about your API. The API will start with /<api:name>/v<model:version[major]>/
  # If you don't specify the api property, you cannot use mllaunchpad's WSGI API.
  # You would eschew mllaunchpad's WSGI API if you want to make it available as
  # part of another service framework, e.g. AWS Lambda or Azure Functions.
  name: iris  # Name of the service API
  raml: tree.raml  # Path to the API's RAML definition (see two sections below)
  preload_datasources: False  # Load datasources into memory before any predictions. Only makes sense with caching (expires != 0).

Details on how to configure specific types of DataSources and DataSinks can be found on the page DataSources and DataSinks.

Plugins

In your Config File, you can optionally use a top-level plugins: key to specify (a list of) modules that should be imported by ML Launchpad (currently only used while initializing the DataSources and DataSinks). If any of these plugins are in conflict with other plugins or built-ins, the last-imported one has precedence over the previous ones.

For example, if several DataSource plugins offer to serve the same type (e.g. csv), the last one in the plugins: list will be chosen as the designated csv handler, overruling both the built-in FileDataSource as well as any other csv-serving DataSources listed before the one in question.

RAML API Definition

The API will be prefixed with /<api:name>/v<model:version[major]>/ from your configuration file (/iris/v0/ in above example). How the API actually looks beyond that is governed by your RAML file.

The RAML specification language has been chosen as the way to specify the API in a way that is compatible with common tools (such as MuleSoft). Other languages do exist, and contributions to support them are welcome.

The RAML is the contract between you and you service API’s clients. How to write a valid RAML is beyond the scope of this documentation. But to help you starting out, there are various examples, and you can generate a basic Query Parameters-based RAML using mllaunchpad generate-raml.

ML Launchpad understands a subset of RAML in order to automatically create APIs for the (currently) three most common use cases (please note that they support GET as well as POST):

Query Parameters

These are named parameters with a value.

E.g. in our “iris” example, in an API call that looks like /iris/v0/varieties?sepal.width=3&sepal.length=1.3[...] these would be sepal.width, sepal.length etc., each with one value:

/varieties:  # the resource name that comes after /iris/v0
  get:  # can also be post
    description: Get a prediction for the variety of iris flower based on measurements of physical petal and sepal dimensions
    queryParameters:
      sepal.length:
        displayName: Sepal Length
        type: number
        description: Measured length of iris flower sepals (flower leaves)
        example: 3.14
        required: false  # test, should be true
        minimum: 0
        repeat: false  # set to true to get param's list of values in your args_dict
      sepal.width:
        displayName: Sepal Width
        type: number
        description: Measured width of iris flower sepals (flower leaves)
        example: 3.14
        required: false  # test, should be true
        minimum: 0
      # ...

The displayName, type, required, example, and minimum/maximum properties are used by ML Launchpad for validation and logging. The optional repeat property turns the parameter into an array if you need it to support multiple values (RAML 0.8 standard). As a half-baked implementation of RAML 1.0 arrays, you can alternatively specify the type with brackets (number[], string[]` etc). Use enum to specify a list of allowed values (for categorical data). Other RAML properties are ignored.

Your model’s predict() method will get passed an args_dict with a key for each query parameter, by which you can access the values.

Query parameters may be combined with URL Parameters (see tree example).

Sidenote: While the technology that ML Launchpad uses under the hood also supports requests with arbitrary JSON bodies which might work with ML Launchpad to provide more complex values, this is at this point in time not officially supported.

URL Parameters

A string in your APIs URL, e.g. /iris/v0/varieties/12, which usually identifies one record in the set of resources.

Example RAML:

/varieties:
  /{my_url_param_name}:  # parameter name to use
    get:  # post also possible
      queryParameters:  # Optional, just to demonstrate that this can be used in conjunction with query parameters.
        hallo:
          description: some demo query parameter in addition to the uri param
          type: string
          required: true
          enum: ['metric', 'imperial']
    # ...

The args_dict passed to your model’s predict() method will contain the value under whatever name you gave it (here: “my_url_param_name”), in addition to any other query parameters.

URL Parameters may be combined with Query Parameters (see tree example).

Files

Handling files (using multipart/form-data) is also possible.

Example RAML:

/topics:
  post:
    description: Upload a PDF file to predict the topic for.
    body:
      multipart/form-data:
        formParameters:
          text:
            displayName: Optional alternative text of a client message
            type: string
            description: The plain text of a clients's letter, email, etc (uncleaned)
            required: false
        properties:
          file:
            description: The PDF file containing the client message, to be uploaded
            required: false
            type: file
            fileTypes: ["application/pdf"]
    # ...

The args_dict passed to your model’s predict() method will contain a parameter named “file” with a FileStorage object. You can get its file name using args_dict["file"].filename and access its contents using args_dict["file"].stream. See the FileStorage documentation for more details.

As can be seen in the example, a file can be combined with Query Parameters. But it cannot currently be combined with URL Parameters in ML Launchpad.