Targeted Users

ElasticDL targets two categories of users

  1. Modelers, those who create new models, including deep learning researchers and engineers, and
  2. SQLFlow users.

The high-level API must meet the requirements of these users.

User Expectations


Modelers usually craft their Keras models on their personal computers, test the model with small datasets, and would like to file a distributed training job with big datasets on the cloud.

Suppose that one is working on a model in the local directory $HOME/work/fintech/*.py, where each .py file might contain one ore more Keras model classes. We would love to allow the user to submit an ElasticDL training job from the command-line like the following to train a model defined as a class MyKerasModel.

elasticdl train \
    --model_zoo=$HOME/work \
    --model_def=fintech.MyKerasModel \
    --input_fn=fintech.credit_data_processor \
    --params="hidden_units=[10, 100, 20, 5], learning_rate=0.01" \
    --data="gs://bucket-name/tony/imagenet/train/*.recordio" \

The above command-line

  1. builds a Docker image containing (1) $HOME/work mapped to /model_zoo/custom, (2) ElasticDL, (3) dependencies of ElasticDL,
  2. submits an ElasticDL job to the Kubernetes cluster as described in $HOME/.kube/config,
  3. prints an URL to the dashboard so users could inspect the progress/status of the job in the user’s Web browser.

Please be aware that in the class fintech.MyKerasModel, in addition to overriding the method call, we also need to provide methods like

  • default_loss that returns a loss operator,
  • default_optimizer that returns an optimizer operator,
  • default_input that takes a record (string) as its input and returns something that can be batched and consumed by In the above example, the user chooses an input function other than MyKerasModel.default_input.

Because the above example command line specifies --input_fn explicitly, the training job is not going to use MyKerasModel.default_input, but uses fintech.credit_data_processor. Similarly, command line options loss and optimizer overwrites MyKerasModel.default_loss and MyKerasModel.default_optimizer.

Another important command-line is to support prediction.

elasticdl predict \
    --data="gs://bucket-name/tony/imagenet/test/*.recordio" \
    --trained_model="gs://bucket-name/tony/my_trained_model" \

SQLFlow Users

SQLFlow users provide the information required by training or prediction by writing a SQL statement with extended syntax. The syntax for training extends the SELECT statement with the TRAIN clause. For example:

SELECT name, role, salary FROM employee 
TRAIN regressor.DNN
WITH hidden_units=[10, 100, 20, 5], learning_rate=0.01
INTO my_trained_model;

Please be aware that to minimize the syntax extension, SQLFlow doesn’t allow users to specify a directory of models; instead, users can only use pre-built models – regressor.DNN in the above example.

SQLFlow is a gRPC server that takes the above SQL statement and translates it into a Python program known as a submitter. It is the responsibility of the submitter to call kubectl to launch an ElasticDL job on a Kubernetes cluster.

SQLFlow often runs in Docker containers, and it is usually intractable to build a Docker image from within a Docker container, so the submitter requires a pre-built Docker image containing (1) /model_zoo, (2) ElasticDL, (3) dependencies of ElasticDL. The class regressor.DNN is a class defined in some Python source files in /model_zoo.

The submitter might file the statement SELECT name, role, salary FROM employee to the SQL engine, pull the result, convert the result into one or more RecordIO files whose each record is a serialization of the tf.Example protobuf message. So, the input function used by ElasticDL to parse the strings for DNNClassifer.class could be standardized one, say, sqlflow.elasticdl_input_function.

To predict using a pre-trained model and to write the results into a column of a table, we can do

SELECT name, role FROM testdata
PREDICT testdata.predicted_salary
USING my_trained_model;

Unified API

Both the command line tool elasticdl provided for modelers and the submitter program generated by SQLFlow need to call an API that launches ElasticDL jobs. Hence this design.


We hope the ElasticDL API supports not only batch learning, but also online learning, adversarial learning, reinforcement learning, and federated learning. However, at the right moment, let us start with batch learning.

For Training

We propose a function elastic.train that can be called like the following:

    params="hidden_units=[10, 100, 20, 5], learning_rate=0.01",


    params="hidden_units=[10, 100, 20, 5], learning_rate=0.01",

Please be aware that most parameters of elasticdl.train are of string-type because the command line options and SQL statements are all strings.

For Prediction

We propose a function elasticdl.predict that can be called like the following:




Model Zoo

When the ElasticDL client or the SQLFlow server call elasticdl.train, this function calls Docker API to build a Docker image then submits the job. The building process should add a model zoo into the Docker image. The function elasticdl.train has a parameter, which could be the following cases:

  1. A local directory, for example,

    elasticdl.train(model_zoo="a_local_directory", ...)
  2. A URL pointing to a Git repo

    elasticdl.train(model_zoo="", ...)

A model zoo is a plain Python source directory that’s added to /model_zoo in the Docker image. In the root directory there requires a requirements.txt file, so the image building process can install dependencies via

RUN pip install -r /model_zoo/requirements.txt

Suppose that a Keras model class is referred to as regressor.DNN in elasticdl.train(model_def="regressor.DNN",, the corresponding Python file should be /model_zoo/ A class regressor.wide_and_deep.MagicalWAD is in a Python file /model_zoo/regressor/

Trained Model

A call to elasticdl.predict looks like the following:


It needs to

  1. build and push a Docker image, and
  2. launch a distributed ElasticDL job of the type “predict”.

The Docker image must contain the model zoo used to train the model trained_model='/filestore/tony/my_keras_model'.

A key question is what information must be in the directory /filestore/tony/my_keras_model.

  1. A Docker image ID.

    We need this ID to refer to the Docker image built during the call of elasticdl.train. In this image, we have the model zoo used to train the model. Then, elasticdl.predict could build the Docker image for the distributed prediction job from this commit ID.

    This image ID must be a pullable ID so that ElasticDL command line tool can docker pull it as the base image. An example pullable ID is docker-pullable://

  2. Model class constructor parameters, like hidden_units=[10, 100, 20].

  3. Other parameters passed to elasticdl.train, including
    • model_def
    • input_function
    • loss
    • optimizer
  4. Model parameters as a map from parameter name to parameter value tensors, defined in elasticdl.proto.

We define a new wrapper message:

message TrainedModel {
    string docker_commit_id = 1;
    string model_def = 2;
    string model_def_params = 3;
    string params_filename = 4;
    string input_function = 5;
    string loss = 6;
    string optimizer = 7;