Model Definition and Model Zoo


SQLFlow extends SQL syntax to allow SQL programmers, a.k.a., analysts, to invoke models defined by Python programmers, a.k.a., data scientists. For each deployment of SQLFlow service, we refer to the collection of model definitions accessible by analysts as a model zoo. A model zoo contains not only the model definitions but also the trained model parameters, as well as the hyperparameters and other information, which are necessary when we use the model for prediction and other analytics jobs.

This document is about how to define models and how to build a model zoo. For the conclusion, please go to the last section of this document.


The following example SQL statement shows the syntax for training a model.

SELECT * FROM employee WHERE onboard_year < 2019
TO TRAIN a_data_scientist/regressors:v0.2/MyDNNRegressor
WITH hidden_units=[10,50,5], lr=0.01
COLUMN gender, scale(age, 0, 1), bucketize(address, 100) 
LABEL salary 
INTO my_first_model;

The string a_data_scientist/regressors:v0.2 names a Docker image, inside which, there is MyDNNRegressor, a Python class derived from tf.keras.Model, and its dependencies. The WITH clause provides hyperparameters required by the constructor of MyDNNRegressor and the training process. The COLUMN clause specifies how to convert the SELECT result, a table, into model inputs in the form of tensors. LABEL identifies the field used as the label, in the case of supervised learning. The training process saves all the above information, plus the estimated model parameters, under the name my_first_model.

The following example fills in the column predicted_salary of the table employee for rows that represent employees recruited in and after 2019.

SELECT * FROM employee WHERE onboard_year >= 2019 
TO PREDICT employee.predicted_salary
USING my_first_model;

Users don’t have to write the COLUMN clause in the above example, as SQLFlow reuses the one in the training statement.


The above training and prediction example reveals some concepts related to models.

  1. A dataset is defined and generated by a SQL SELECT statement. For example, SELECT * FROM employee WHERE onboard_year < 2019.

  2. A data source is a table, for example, employee. Please be aware that we can cache and reuse a dataset until its data sources are changed.

  3. A model definition, for example, MyDNNRegressor, is a Python class or other kinds of code that specify a model.

  4. The SQL data to model inputs conversion, or data conversion, is the COLUMN and optionally LABEL clauses.

  5. Hyperparameters
    1. Model hyperparameters are some key-value pairs specified in the WITH clause. They are arguments to the constructor of the model definition.
    2. Training hyperparameters appear in WITH, as model constructors do. They affect the training process.
    3. Data conversion hyperparameters appear in COLUMN and optionally LABEL. For example, the scale range and bucket size in the first example.
  6. A trained model consists of all the above concepts and the estimated model parameters.

In SQLFlow SQL grammar, the identifiers after TRAIN, USING and INTO have different meanings:

  1. TRAIN IDENT: IDENT is the name of a model definition.
  2. USING IDENT: IDENT is the model ID referring to a trained model, please refer to the below sections for the definition of model ID.
  3. INTO IDENT: IDENT is the model ID referring to a trained model.

The Design

Versioning and Releasing

A key to this design is the representation of the above concepts. It is not a straightforward solution. For example, in prior work and previous discussions, engineers proposed to represent each model definition by a Python source code file hosted on an FTP service and downloaded when some users file a model training statement. This intuitive proposal could lead to inconsistencies. Suppose that an analyst trained a model using the definition in and got my_first_model; soon after that, a data scientist changed and the analyst re-trained the model into my_second_model with slightly modified hyperparameter settings. The analyst might expect that both models share the same definition; however, they don’t, and worse than that, there is no mechanism to remind the change of the model definition to the analyst.

Such design flaw roots from the ignoring of the fact that model definitions are code that has versions. Another important aspect is that we often build the source code into some release-ready form. Once we noticed these facts, we can use version management tools like Git and release engineering tools like Docker. Here follows our proposal.

  1. A collection of model definitions is a Git repository of source files.
  2. To describe dependencies, we require a Dockerfile at the root directory of the repository.
  3. To release a repository, we checkout the specific version and run docker build and docker push with the Dockerfile.

Submitter Programs

For a SELECT program using the SQLFlow syntax extension, the SQLFlow server converts it into a submitter program, usually in Python.

Consider the training statement at the beginning of this document. Suppose that the Python class MyDNNRegressor is in source file in the repository, which builds into Docker image a_data_scientist/regressors. Let us also assume that the submitter program is /var/sqlflow/submitters/an_analyst/ SQLFlow can run the submitter using the Docker image as follows:

docker run --rm -it \
  -v /var/sqlflow/submitters:/submitters \
  a_data_scientist/regressors \
    python /submitters/an_analyst/

To submit an ElasticDL training job, we need the following actions:

  1. The SQLFlow server invokes codegen_elasticdl.go to generates the submitter program
  2. calls the ElasticDL client API to submit a job.
  3. The SQLFlow server then runs in a Docker container (if SQLFlow service is not running in a Kubernetes cluster), or a Pod, which has ElasticDL client library installed.
  4. As this container needs to contain ElasticDL and the model definition, all model Docker images should derive from our base image, which installs ElasticDL by default:
FROM sqlflow/sqlflow_model_base

If we run SQLFlow in a Docker container, to allow it to run another Docker container, we must enable the Docker-in-Docker feature, say, following this blog post. Suppose that the SQLFlow server runs in a Kubernetes cluster, where we cannot enable Docker-in-Docker. In such a case, we can make the SQLFlow server call Kubernetes API to run a Pod that executes the above docker run command line.

The training submitter program, running in the model definition Docker container, should be able to submit a (distributed) training job to a preconfigured cluster. Once the job completes, the submitter program adds, or edits, a row in a model zoo table.

Model Zoo Data Schema

The model zoo table is in a database deployed as part of the SQLFlow service. This database might not be the one that holds data sources. The only requirement of the model zoo database is to have two tables: the trained models table and the evaluation result table.

Trained Models

Once a training job completes, the submitter program adds/updates a row of the the trained models table, which contains (at least) the following fields.

  1. The model ID, specified by the INTO clause, or my_first_model in the above example.
  2. The creator, as defined in the INTO clause, or an_analyst in the above example.
  3. The model zoo release, which is a Docker image commit ID, or a_data_scientist/regressors in the above example.
  4. The model definition, which is a Python class name, or DNNRegressor in the above example.
  5. The submitter program, the source code of the submitter program, in the above example, or its MD5 hash.
  6. The data converter, including the COLUMN and LABEL clauses.
  7. The model parameter file path, the path to the trained model parameters on the distributed filesystem of the cluster.

It is necessary to have the model ID so users can refer to the trained model when they want to use it. Suppose that the user typed the prediction SQL statement at the beginning of this document. SQLFlow server will convert it into a submitter program and run it with the Docker image used to train the model. Therefore, the Docker image ID is also required. The model parameter path allows the prediction submitter program to locate and load the trained models. The data converter helps the prediction submitter to use the conversion rules consistent with the ones used when training.

It is necessary to record the content or the MD5 hash of the training submitter program in the model zoo table for experiment management. Please be aware that the training submitter encodes all three categories of hyperparameters, as listed in the above sections. Suppose that the analyst re-trains the model with different hyperparameter settings, the training submitter changes accordingly, and SQLFlow should be able to remind the analyst to either uses a new model ID or overwrites the existing row in the model zoo table.

We recommend reusing the DBMS configured as the data source of SQLFlow for storing model zoo. Following this recommendation, users can query the trained models using SQL. For example, the following SQL statement lists all models trained by an_analyst:

SELECT * FROM sqlflow.trained_models WHERE creator="an_analyst"

The following statement queries all model zoos used by an_analyst to train his/her models.

SELECT DISTINCT model_zoo FROM sqlflow.trained_models WHERE creator="an_analyst"

Users can checkout the model list and saved models published by through the web site: Then the user can use a SQL statement like SELECT ... TO PREDICT ... USING to use that saved model, just by specifying a model from after USING clause. See below section for more details.

The Model Evaluation Table

Once an evaluation completes, the submitter program adds/updates a row of the evaluation result table, which contains the following fields:

  1. model ID
  2. evaluation dataset
  3. metrics

Different kinds of models might use various metrics, so the field metrics might be string-typed and saves a JSON, like

   "recall": 0.45,
   "precision": 0.734

Model Sharing

After all, what is a model zoo? A model zoo refers to all the model definitions and trained models accessible by a setup of the SQLFlow server. It contains one or more model definition Docker images and the source code repositories that build the images. It also includes the model zoo table configured to work with all submitter programs generated by the deployment.

Within a deployment, it is straightforward to share a trained model. If the analyst, an_analyst, in the above example wants to use her own trained model my_first_model for prediction, she could use the short name my_first_model.

SELECT ... TO PREDICT ... USING my_first_model

If another analyst wants to use the trained model, he would need to use the full name as an_analyst/my_first_model.

SELECT ... TO PREDICT ... USING an_analyst/my_first_model

Use a published model make some predictions on new data:

SELECT ... TO PREDICT employee.predicted_salary USING

Use a published model to train on a new dataset:

SELECT * FROM employee WHERE onboard_year < 2019
TO TRAIN an_analyst/regressors:v0.2/MyDNNRegressor

Model Publication

Publishing a model definition, we hope all users of all SQLFlow deployments can train the model using their datasets. To achieve this goal, data scientists need to set up a continuous deployment system to automatically build their model definition repositories into Docker images and push to a public Docker registry like In the above example, the Docker image a_data_scientist/regressors is one hosted on

It requires more steps to publish a trained model. We need a public registry like, but hosts rows from the model zoo table and corresponding model parameters. We propose the following extended SQL syntax for analysts an_analyst to publish her trained model.

SQLFLOW PUBLISH my_first_model

This statement uploads the model parameters and other information to the registry service, which defaults to, and under the account user_name, which defaults to an_analyst in the above example.

Then, another analyst should be able to use the trained model by referring to it in its full name.

SELECT ... TO PREDICT employee.predicted_salary USING


There are three roles in the ecosystem of SQLFlow:

  1. the tool developers who use Go/C++ to create SQLFlow,
  2. the model developers, or data scientists, who use Python to define, say, Keras models, and
  3. the analysts, who use SQL to train models, or to use trained models for prediction or model explanation.

Any data scientist can create an arbitrary number of model zoos, and in each model zoo, there could be any number of model definitions. There are some concepts from the perspective of a data scientist a_data_scientist:

  1. A model zoo is a Git repo, say,
  2. A model zoo is built and published in the form of a Docker image, say,

From the perspective of an analyst, say, an_analyst, who is going to use the model definition DNNRegressor defined in, s/he could refer to the model definition by

  •, which is the full name,
  •, if the Docker image tag is latest,
  • a_data_scientist/regressor/DNNRegressor, if is the default Docker registry, or
  • regressor/DNNRegressor, if the user is the model developer a_data_scientist.

The class DNNRegressor could be in any file in any directory of the repository The only requirement is that when the submitter program trains the model, it can run a Docker command, or an equivalent Docker API call, to train the model.

A trained model could have its name in any of the following forms:

  • my_first_model
  • an_analyst/my_first_model
  • a_database_project/an_analyst/my_first_model
  • defaults to the SQL engine that hosts the data source. a_database_project defaults to sqlflow. an_analyst and my_first_model are in a row in the database table trained_models.