Targeted Users
ElasticDL targets two categories of users
- Modelers, those who create new models, including deep learning researchers and engineers, and
- SQLFlow users.
The high-level API must meet the requirements of these users.
User Expectations
Modelers
Modelers usually craft their Keras models on their personal computers, test the model with small datasets, and would like to file a distributed training job with big datasets on the cloud.
Suppose that one is working on a model in the local directory
$HOME/work/fintech/*.py, where each .py file might contain one ore more
Keras model classes. We would love to allow the user to submit an ElasticDL
training job from the command-line like the following to train a model defined
as a class MyKerasModel.
elasticdl train \
--model_zoo=$HOME/work \
--model_def=fintech.MyKerasModel \
--input_fn=fintech.credit_data_processor \
--params="hidden_units=[10, 100, 20, 5], learning_rate=0.01" \
--data="gs://bucket-name/tony/imagenet/train/*.recordio" \
--output="gs://bucket-name/tony/my_trained_model"
The above command-line
- builds a Docker image containing (1)
$HOME/workmapped to/model_zoo/custom, (2) ElasticDL, (3) dependencies of ElasticDL, - submits an ElasticDL job to the Kubernetes cluster as described in
$HOME/.kube/config, - prints an URL to the dashboard so users could inspect the progress/status of the job in the user’s Web browser.
Please be aware that in the class fintech.MyKerasModel, in addition to
overriding the method call, we also need to provide methods like
default_lossthat returns a loss operator,default_optimizerthat returns an optimizer operator,default_inputthat takes a record (string) as its input and returns something that can be batched and consumed byMyKerasModel.call. In the above example, the user chooses an input function other thanMyKerasModel.default_input.
Because the above example command line specifies --input_fn explicitly, the
training job is not going to use MyKerasModel.default_input, but uses
fintech.credit_data_processor. Similarly, command line options loss and
optimizer overwrites MyKerasModel.default_loss and
MyKerasModel.default_optimizer.
Another important command-line is to support prediction.
elasticdl predict \
--data="gs://bucket-name/tony/imagenet/test/*.recordio" \
--trained_model="gs://bucket-name/tony/my_trained_model" \
--output="gs://bucket-name/tony/imagenet-eval/"
SQLFlow Users
SQLFlow users provide the information required by training or prediction by writing a SQL statement with extended syntax. The syntax for training extends the SELECT statement with the TRAIN clause. For example:
SELECT name, role, salary FROM employee
TRAIN regressor.DNN
WITH hidden_units=[10, 100, 20, 5], learning_rate=0.01
INTO my_trained_model;
Please be aware that to minimize the syntax extension, SQLFlow doesn’t allow
users to specify a directory of models; instead, users can only use pre-built
models – regressor.DNN in the above example.
SQLFlow is a gRPC server that takes the above SQL statement and translates it
into a Python program known as a submitter. It is the responsibility of the
submitter to call kubectl to launch an ElasticDL job on a Kubernetes cluster.
SQLFlow often runs in Docker containers, and it is usually intractable to build
a Docker image from within a Docker container, so the submitter requires a
pre-built Docker image containing (1) /model_zoo, (2) ElasticDL, (3)
dependencies of ElasticDL. The class regressor.DNN is a class defined in
some Python source files in /model_zoo.
The submitter might file the statement SELECT name, role, salary FROM
employee to the SQL engine, pull the result, convert the result into one or
more RecordIO files whose each record is a serialization of the tf.Example
protobuf message. So, the input function used by ElasticDL to parse the strings
for DNNClassifer.class could be standardized one, say,
sqlflow.elasticdl_input_function.
To predict using a pre-trained model and to write the results into a column of a table, we can do
SELECT name, role FROM testdata
PREDICT testdata.predicted_salary
USING my_trained_model;
Unified API
Both the command line tool elasticdl provided for modelers and the submitter
program generated by SQLFlow need to call an API that launches ElasticDL jobs.
Hence this design.
API
We hope the ElasticDL API supports not only batch learning, but also online learning, adversarial learning, reinforcement learning, and federated learning. However, at the right moment, let us start with batch learning.
For Training
We propose a function elastic.train that can be called like the following:
elasticdl.train(
model_zoo="$HOME/work",
model_def="fintech.MyKerasModel",
input_fn="fintech.credit_data_processor",
params="hidden_units=[10, 100, 20, 5], learning_rate=0.01",
data="gs://bucket-name/tony/imagenet/train/*.recordio",
output="gs://bucket-name/tony/my_trained_model")
or
elasticdl.train(
model_zoo="https://github.com/sql-machine-learning/models",
model_def="regressor.DNN",
input_fn="sqlflow.elasticdl_input_function',
params="hidden_units=[10, 100, 20, 5], learning_rate=0.01",
data="gs://sqlflow/job-xxyyzz/train/*.recordio",
output="gs://sqlflow/job-xxyyzz/my_trained_model")
Please be aware that most parameters of elasticdl.train are of string-type
because the command line options and SQL statements are all strings.
For Prediction
We propose a function elasticdl.predict that can be called like the following:
elasticdl.predict(
data='gs://bucket-name/tony/imagenet/test/*.recordio',
trained_model='gs://bucket-name/tony/my_trained_model',
output='gs://bucket-name/tony/imagenet-eval.recordio')
or
elasticdl.predict(
data="gs://sqlflow/job-xxyyzz/predict/*.recordio",
trained_model='gs://sqlflow/job-xxyyzz/my_trained_model,
output="gs://sqlflow/job-xxyyzz/predicted/")
Model Zoo
When the ElasticDL client or the SQLFlow server call elasticdl.train, this
function calls Docker API to build a Docker image then submits the job. The
building process should add a model zoo into the Docker image. The function
elasticdl.train has a parameter, which could be the following cases:
-
A local directory, for example,
elasticdl.train(model_zoo="a_local_directory", ...) -
A URL pointing to a Git repo
elasticdl.train( model_zoo="https://git.company.com/sql-machine-learning/models", ... )
A model zoo is a plain Python source directory that’s added to /model_zoo in
the Docker image. In the root directory there requires a requirements.txt
file, so the image building process can install dependencies via
RUN pip install -r /model_zoo/requirements.txt
Suppose that a Keras model class is referred to as regressor.DNN in
elasticdl.train(model_def="regressor.DNN",, the corresponding Python file
should be /model_zoo/regressor.py. A class
regressor.wide_and_deep.MagicalWAD is in a Python file
/model_zoo/regressor/wide_and_deep.py.
Trained Model
A call to elasticdl.predict looks like the following:
elasticdl.predict(
data='/filestore/yiwang/imagenet/test/*.recordio',
trained_model='/filestore/tony/my_keras_model',
output='/filestore/yiwang/imagenet-eval.recordio')
It needs to
- build and push a Docker image, and
- launch a distributed ElasticDL job of the type “predict”.
The Docker image must contain the model zoo used to train the model
trained_model='/filestore/tony/my_keras_model'.
A key question is what information must be in the directory
/filestore/tony/my_keras_model.
-
A Docker image ID.
We need this ID to refer to the Docker image built during the call of
elasticdl.train. In this image, we have the model zoo used to train the model. Then,elasticdl.predictcould build the Docker image for the distributed prediction job from this commit ID.This image ID must be a pullable ID so that ElasticDL command line tool can
docker pullit as the base image. An example pullable ID isdocker-pullable://reg.docker.alibaba-inc.com/asdi/aswf-py3@sha256:e8ca09705eed0 7cdfd060b6b9d27a802. -
Model class constructor parameters, like
hidden_units=[10, 100, 20]. - Other parameters passed to
elasticdl.train, includingmodel_definput_functionlossoptimizer
- Model parameters as a map from parameter name to parameter value tensors,
defined in
elasticdl.proto.
We define a new wrapper message:
message TrainedModel {
string docker_commit_id = 1;
string model_def = 2;
string model_def_params = 3;
string params_filename = 4;
string input_function = 5;
string loss = 6;
string optimizer = 7;
}