Targeted Users
ElasticDL targets two categories of users
- Modelers, those who create new models, including deep learning researchers and engineers, and
- SQLFlow users.
The high-level API must meet the requirements of these users.
User Expectations
Modelers
Modelers usually craft their Keras models on their personal computers, test the model with small datasets, and would like to file a distributed training job with big datasets on the cloud.
Suppose that one is working on a model in the local directory
$HOME/work/fintech/*.py
, where each .py
file might contain one ore more
Keras model classes. We would love to allow the user to submit an ElasticDL
training job from the command-line like the following to train a model defined
as a class MyKerasModel
.
elasticdl train \
--model_zoo=$HOME/work \
--model_def=fintech.MyKerasModel \
--input_fn=fintech.credit_data_processor \
--params="hidden_units=[10, 100, 20, 5], learning_rate=0.01" \
--data="gs://bucket-name/tony/imagenet/train/*.recordio" \
--output="gs://bucket-name/tony/my_trained_model"
The above command-line
- builds a Docker image containing (1)
$HOME/work
mapped to/model_zoo/custom
, (2) ElasticDL, (3) dependencies of ElasticDL, - submits an ElasticDL job to the Kubernetes cluster as described in
$HOME/.kube/config
, - prints an URL to the dashboard so users could inspect the progress/status of the job in the user’s Web browser.
Please be aware that in the class fintech.MyKerasModel
, in addition to
overriding the method call
, we also need to provide methods like
default_loss
that returns a loss operator,default_optimizer
that returns an optimizer operator,default_input
that takes a record (string) as its input and returns something that can be batched and consumed byMyKerasModel.call
. In the above example, the user chooses an input function other thanMyKerasModel.default_input
.
Because the above example command line specifies --input_fn
explicitly, the
training job is not going to use MyKerasModel.default_input
, but uses
fintech.credit_data_processor
. Similarly, command line options loss
and
optimizer
overwrites MyKerasModel.default_loss
and
MyKerasModel.default_optimizer
.
Another important command-line is to support prediction.
elasticdl predict \
--data="gs://bucket-name/tony/imagenet/test/*.recordio" \
--trained_model="gs://bucket-name/tony/my_trained_model" \
--output="gs://bucket-name/tony/imagenet-eval/"
SQLFlow Users
SQLFlow users provide the information required by training or prediction by writing a SQL statement with extended syntax. The syntax for training extends the SELECT statement with the TRAIN clause. For example:
SELECT name, role, salary FROM employee
TRAIN regressor.DNN
WITH hidden_units=[10, 100, 20, 5], learning_rate=0.01
INTO my_trained_model;
Please be aware that to minimize the syntax extension, SQLFlow doesn’t allow
users to specify a directory of models; instead, users can only use pre-built
models – regressor.DNN
in the above example.
SQLFlow is a gRPC server that takes the above SQL statement and translates it
into a Python program known as a submitter. It is the responsibility of the
submitter to call kubectl
to launch an ElasticDL job on a Kubernetes cluster.
SQLFlow often runs in Docker containers, and it is usually intractable to build
a Docker image from within a Docker container, so the submitter requires a
pre-built Docker image containing (1) /model_zoo
, (2) ElasticDL, (3)
dependencies of ElasticDL. The class regressor.DNN
is a class defined in
some Python source files in /model_zoo
.
The submitter might file the statement SELECT name, role, salary FROM
employee
to the SQL engine, pull the result, convert the result into one or
more RecordIO files whose each record is a serialization of the tf.Example
protobuf message. So, the input function used by ElasticDL to parse the strings
for DNNClassifer.class
could be standardized one, say,
sqlflow.elasticdl_input_function
.
To predict using a pre-trained model and to write the results into a column of a table, we can do
SELECT name, role FROM testdata
PREDICT testdata.predicted_salary
USING my_trained_model;
Unified API
Both the command line tool elasticdl
provided for modelers and the submitter
program generated by SQLFlow need to call an API that launches ElasticDL jobs.
Hence this design.
API
We hope the ElasticDL API supports not only batch learning, but also online learning, adversarial learning, reinforcement learning, and federated learning. However, at the right moment, let us start with batch learning.
For Training
We propose a function elastic.train
that can be called like the following:
elasticdl.train(
model_zoo="$HOME/work",
model_def="fintech.MyKerasModel",
input_fn="fintech.credit_data_processor",
params="hidden_units=[10, 100, 20, 5], learning_rate=0.01",
data="gs://bucket-name/tony/imagenet/train/*.recordio",
output="gs://bucket-name/tony/my_trained_model")
or
elasticdl.train(
model_zoo="https://github.com/sql-machine-learning/models",
model_def="regressor.DNN",
input_fn="sqlflow.elasticdl_input_function',
params="hidden_units=[10, 100, 20, 5], learning_rate=0.01",
data="gs://sqlflow/job-xxyyzz/train/*.recordio",
output="gs://sqlflow/job-xxyyzz/my_trained_model")
Please be aware that most parameters of elasticdl.train
are of string-type
because the command line options and SQL statements are all strings.
For Prediction
We propose a function elasticdl.predict
that can be called like the following:
elasticdl.predict(
data='gs://bucket-name/tony/imagenet/test/*.recordio',
trained_model='gs://bucket-name/tony/my_trained_model',
output='gs://bucket-name/tony/imagenet-eval.recordio')
or
elasticdl.predict(
data="gs://sqlflow/job-xxyyzz/predict/*.recordio",
trained_model='gs://sqlflow/job-xxyyzz/my_trained_model,
output="gs://sqlflow/job-xxyyzz/predicted/")
Model Zoo
When the ElasticDL client or the SQLFlow server call elasticdl.train
, this
function calls Docker API to build a Docker image then submits the job. The
building process should add a model zoo into the Docker image. The function
elasticdl.train
has a parameter, which could be the following cases:
-
A local directory, for example,
elasticdl.train(model_zoo="a_local_directory", ...)
-
A URL pointing to a Git repo
elasticdl.train( model_zoo="https://git.company.com/sql-machine-learning/models", ... )
A model zoo is a plain Python source directory that’s added to /model_zoo
in
the Docker image. In the root directory there requires a requirements.txt
file, so the image building process can install dependencies via
RUN pip install -r /model_zoo/requirements.txt
Suppose that a Keras model class is referred to as regressor.DNN
in
elasticdl.train(model_def="regressor.DNN",
, the corresponding Python file
should be /model_zoo/regressor.py
. A class
regressor.wide_and_deep.MagicalWAD
is in a Python file
/model_zoo/regressor/wide_and_deep.py
.
Trained Model
A call to elasticdl.predict
looks like the following:
elasticdl.predict(
data='/filestore/yiwang/imagenet/test/*.recordio',
trained_model='/filestore/tony/my_keras_model',
output='/filestore/yiwang/imagenet-eval.recordio')
It needs to
- build and push a Docker image, and
- launch a distributed ElasticDL job of the type “predict”.
The Docker image must contain the model zoo used to train the model
trained_model='/filestore/tony/my_keras_model'
.
A key question is what information must be in the directory
/filestore/tony/my_keras_model
.
-
A Docker image ID.
We need this ID to refer to the Docker image built during the call of
elasticdl.train
. In this image, we have the model zoo used to train the model. Then,elasticdl.predict
could build the Docker image for the distributed prediction job from this commit ID.This image ID must be a pullable ID so that ElasticDL command line tool can
docker pull
it as the base image. An example pullable ID isdocker-pullable://reg.docker.alibaba-inc.com/asdi/aswf-py3@sha256:e8ca09705eed0 7cdfd060b6b9d27a802
. -
Model class constructor parameters, like
hidden_units=[10, 100, 20]
. - Other parameters passed to
elasticdl.train
, includingmodel_def
input_function
loss
optimizer
- Model parameters as a map from parameter name to parameter value tensors,
defined in
elasticdl.proto
.
We define a new wrapper message:
message TrainedModel {
string docker_commit_id = 1;
string model_def = 2;
string model_def_params = 3;
string params_filename = 4;
string input_function = 5;
string loss = 6;
string optimizer = 7;
}