Configuration¶
The configuration file is the glue that holds your ML-Launchpad-based application together. It links the things on the “inside”, that is, your model’s implementation, to the things on the “outside”, such as the data connection (DataSources and DataSinks), as well as the API configuration.
Sidenote: You can use this to your advantage when developing and testing your machine learning algorithm by using different configuration files for different purposes of your development life cycle. That way, you can cleanly separate different environments like development/testing/production, without having to touch your code (using the same build) when switching between these environments.
For ML Launchpad to know how to do its job, it always needs a configuration. To accommodate different ways of using ML Launchpad, you have different options of providing the configuration. From most common to least common:
Provide the path to the config file on the command line (
--config
or-c
option).Set the environment variable
LAUNCHPAD_CFG
to the path of the config file.Put a config file named
LAUNCHPAD_CFG.yml
in the current working directory.Call
get_validated_config()
with the path of the config file to get a configdict
and provide it as an argument when callingpredict()
,retest()
and/ortrain_model()
.
Way of providing config |
When to use |
---|---|
|
when developing; in some cases training |
|
developing; when deployed (e.g. in production) |
|
deployed |
in code |
when using |
Note: Besides LAUNCHPAD_CFG
, there is also the LAUNCHPAD_LOG
environment
variable, which, if provided, will be used as the logging configuration file.
Config File¶
The configuration file is written in YAML (.yml) format (used internally as
a Python dict
). It uses UTF-8 encoding.
Here’s an example configuration with comments:
plugins: # Optionally specify any additional imports (only external DataSources/-Sinks for now, cf. ``DataSources``)
- bogusdatasource
- records_datasource
datasources: # This section is optional. Places to get data from, and how.
petals: # Name by which you want to refer to the datasource, e.g. using ``data_sources["petals"]``/
# The properties ``type``, ``expires``, ``options`` and ``tags`` are present
# in all types of datasources/datasinks.
# All other properties are specific to the datasource type.
type: csv # Generic; the type of the datasource
path: ./iris_train.csv # Can also be a URL. Valid URL schemes: http, ftp, s3, and file
expires: 0 # -1: never (=cached forever), 0: immediately (=no caching), >0: time in seconds
cache_size: 10 # Optional: maximum number of different results to keep in memory, default=32
options: {} # Special kwargs to pass to the datasource's implementation
tags: train # String or list of strings. Valid are "train", "test" and/or "predict".
petals_test:
type: csv
path: ./iris_holdout.csv
expires: 3600
options: {}
tags: test
# You can define as many datasources and datasinks as you like.
# The tags "train", "test" and/or "predict" will determine which datasources/datasink
# will be provided to which functions in your model implementation.
# Any combination of tags with datasources/datasinks is valid.
# datasinks: # This section is optional. Places to put data. NOT needed for prediction outputs, unless you require batch output, special file formats, etc.
# The configuration structure of datasinks is equivalent to that of datasources.
model_store: # Required. Where your model and metadata is persisted.
location: ./model_store # Directory on file system (local or remote).
model: # Required. Details about your model's implementation.
name: TreeModel
version: '0.0.1' # Use semantic versioning (<breaking>.<adding>.<fix>), first segment will be used in API url as e.g. .../v1/...
module: tree_model # Main module of your functionality. Same as source code file name without .py
# Put custom properties for your implementation here.
# For example, to configure NLP-related aspects of your model (language, etc.),
# to perform fewer iterations for testing purposes, etc.
# It is not recommended to put low-level hyperparameters here.
api: # Optional. Details about your API. The API will start with /<api:name>/v<model:version[major]>/
# If you don't specify the api property, you cannot use mllaunchpad's WSGI API.
# You would eschew mllaunchpad's WSGI API if you want to make it available as
# part of another service framework, e.g. AWS Lambda or Azure Functions.
name: iris # Name of the service API
raml: tree.raml # Path to the API's RAML definition (see two sections below)
preload_datasources: False # Load datasources into memory before any predictions. Only makes sense with caching (expires != 0).
Details on how to configure specific types of DataSources
and DataSinks
can be found
on the page DataSources and DataSinks.
Plugins¶
In your Config File, you can optionally use a top-level plugins:
key to
specify (a list of) modules that should be imported by ML Launchpad (currently only used
while initializing the DataSources and DataSinks). If any of these plugins are in conflict
with other plugins or built-ins, the last-imported one has precedence over
the previous ones.
For example, if several DataSource plugins offer to serve the
same type (e.g. csv
), the last one in the plugins:
list will be chosen as the
designated csv
handler, overruling both the built-in FileDataSource
as well as any other csv
-serving DataSources listed before the one in question.
RAML API Definition¶
The API will be prefixed with /<api:name>/v<model:version[major]>/
from your configuration
file (/iris/v0/
in above example). How the API actually looks beyond that is governed by your RAML file.
The RAML specification language has been chosen as the way to specify the API in a way that is compatible with common tools (such as MuleSoft). Other languages do exist, and contributions to support them are welcome.
The RAML is the contract between you and you service API’s clients. How to write a valid RAML is beyond the scope of this documentation. But to help you starting out, there are various examples, and you can generate a basic Query Parameters-based RAML using mllaunchpad generate-raml.
ML Launchpad understands a subset of RAML in order to automatically create APIs for the (currently) three most common use cases (please note that they support GET as well as POST):
Query Parameters¶
These are named parameters with a value.
E.g. in our “iris” example, in an API call that looks like
/iris/v0/varieties?sepal.width=3&sepal.length=1.3[...]
these would be sepal.width
, sepal.length
etc., each
with one value:
/varieties: # the resource name that comes after /iris/v0
get: # can also be post
description: Get a prediction for the variety of iris flower based on measurements of physical petal and sepal dimensions
queryParameters:
sepal.length:
displayName: Sepal Length
type: number
description: Measured length of iris flower sepals (flower leaves)
example: 3.14
required: false # test, should be true
minimum: 0
repeat: false # set to true to get param's list of values in your args_dict
sepal.width:
displayName: Sepal Width
type: number
description: Measured width of iris flower sepals (flower leaves)
example: 3.14
required: false # test, should be true
minimum: 0
# ...
The displayName
, type
, required
, example
, and minimum
/maximum
properties are
used by ML Launchpad for validation and logging. The optional repeat
property turns the parameter
into an array if you need it to support multiple values (RAML 0.8 standard). As a half-baked implementation
of RAML 1.0 arrays, you can alternatively specify the type with brackets (number[]
, string[]` etc).
Use enum
to specify a list of allowed values (for categorical data).
Other RAML properties are ignored.
Your model’s predict()
method will get passed an args_dict
with a key for each query parameter, by which you can access the values.
Query parameters may be combined with URL Parameters (see tree example).
Sidenote: While the technology that ML Launchpad uses under the hood also supports requests with arbitrary JSON bodies which might work with ML Launchpad to provide more complex values, this is at this point in time not officially supported.
URL Parameters¶
A string in your APIs URL, e.g. /iris/v0/varieties/12
,
which usually identifies one record in the set of resources.
Example RAML:
/varieties:
/{my_url_param_name}: # parameter name to use
get: # post also possible
queryParameters: # Optional, just to demonstrate that this can be used in conjunction with query parameters.
hallo:
description: some demo query parameter in addition to the uri param
type: string
required: true
enum: ['metric', 'imperial']
# ...
The args_dict
passed to your model’s predict()
method
will contain the value under whatever name you gave it (here: “my_url_param_name”),
in addition to any other query parameters.
URL Parameters may be combined with Query Parameters (see tree example).
Files¶
Handling files (using multipart/form-data
) is also possible.
Example RAML:
/topics:
post:
description: Upload a PDF file to predict the topic for.
body:
multipart/form-data:
formParameters:
text:
displayName: Optional alternative text of a client message
type: string
description: The plain text of a clients's letter, email, etc (uncleaned)
required: false
properties:
file:
description: The PDF file containing the client message, to be uploaded
required: false
type: file
fileTypes: ["application/pdf"]
# ...
The args_dict
passed to your model’s predict()
method
will contain a parameter named “file” with a FileStorage
object. You can get its file name using args_dict["file"].filename
and access its contents using args_dict["file"].stream
.
See the FileStorage documentation for more details.
As can be seen in the example, a file can be combined with Query Parameters. But it cannot currently be combined with URL Parameters in ML Launchpad.