SQLFlow extends SQL syntax to allow SQL programmers, a.k.a., analysts, to invoke models defined by Python programmers, a.k.a., data scientists. For each deployment of SQLFlow service, we refer to the collection of model definitions accessible by analysts as a model zoo. A model zoo contains not only the model definitions but also the trained model parameters, as well as the hyperparameters and other information, which are necessary when we use the model for prediction and other analytics jobs.
This document is about how to define models and how to build a model zoo. For the conclusion, please go to the last section of this document.
The following example SQL statement shows the syntax for training a model.
SELECT * FROM employee WHERE onboard_year < 2019 TO TRAIN a_data_scientist/regressors:v0.2/MyDNNRegressor WITH hidden_units=[10,50,5], lr=0.01 COLUMN gender, scale(age, 0, 1), bucketize(address, 100) LABEL salary INTO my_first_model;
a_data_scientist/regressors:v0.2 names a Docker image, inside which, there is
MyDNNRegressor, a Python class derived from
tf.keras.Model, and its dependencies. The
WITH clause provides hyperparameters required by the constructor of
MyDNNRegressor and the training process. The
COLUMN clause specifies how to convert the SELECT result, a table, into model inputs in the form of tensors.
LABEL identifies the field used as the label, in the case of supervised learning. The training process saves all the above information, plus the estimated model parameters, under the name
The following example fills in the column
predicted_salary of the table
employee for rows that represent employees recruited in and after 2019.
SELECT * FROM employee WHERE onboard_year >= 2019 TO PREDICT employee.predicted_salary USING my_first_model;
Users don’t have to write the
COLUMN clause in the above example, as SQLFlow reuses the one in the training statement.
The above training and prediction example reveals some concepts related to models.
A dataset is defined and generated by a SQL SELECT statement. For example,
SELECT * FROM employee WHERE onboard_year < 2019.
A data source is a table, for example,
employee. Please be aware that we can cache and reuse a dataset until its data sources are changed.
A model definition, for example,
MyDNNRegressor, is a Python class or other kinds of code that specify a model.
The SQL data to model inputs conversion, or data conversion, is the COLUMN and optionally LABEL clauses.
- Model hyperparameters are some key-value pairs specified in the WITH clause. They are arguments to the constructor of the model definition.
- Training hyperparameters appear in WITH, as model constructors do. They affect the training process.
- Data conversion hyperparameters appear in COLUMN and optionally LABEL. For example, the scale range and bucket size in the first example.
- A trained model consists of all the above concepts and the estimated model parameters.
In SQLFlow SQL grammar, the identifiers after
INTO have different meanings:
IDENTis the name of a model definition.
model IDreferring to a trained model, please refer to the below sections for the definition of
model IDreferring to a trained model.
A key to this design is to know that both the model definition and the trained model have versions. Suppose that an analyst trained a model using the definition in
my_dnn_regressor.py and got
my_first_model; soon after that, a data scientist changed
my_dnn_regressor.py and the analyst re-trained the model into
my_second_model with slightly modified hyperparameter settings. The analyst must be aware that he should use the same version of
my_dnn_regressor.py to train these two models. Then if the analyst decides to publish the trained model
my_second_model for online prediction, he should choose to use
my_second_model rather than
We can use version management tools like Git and release engineering tools like Docker and save the trained model with a unique name. Here follows our proposal.
- A collection of model definitions is a Git repository of source files.
- To describe dependencies, we require a Dockerfile at the root directory of the repository.
- To release a repository, we checkout the specific version and run
docker pushwith the Dockerfile.
- Each trained model is saved under the current user’s namespace with a unique name.
Model developers can have their own Git repository or other code version control systems to store the code of model definition Python files. The Python code files should be put into one directory as a Python package, then write a
Dockerfile under the root directory for building the image:
- your_model_package/ -- __init__.py -- some_model.py -- test_some_model.py - Dockerfile
Here is a template of what the
Dockerfile should look like:
# base image sqlflow/modelzoo_base have SQLFlow environment prepared FROM sqlflow/modelzoo_base RUN pip install scikit-learn six # copy model definition code to /sqlflow_models ADD your_model_package /sqlflow_models # add PYTHONPATH environment variable to /sqlflow_models ENV PYTHONPATH /sqlflow_models
Model developers can do continuous development using this repository, forking branches, adding tags, running unit tests, etc.
When the model developer needs to publish the model for SQLFlow to use, they need to build the Docker image and push the image to a central repository.
docker build -t a_data_scientist/regressors:v0.2 . docker push a_data_scientist/regressors:v0.2
You can publish the image to a public Docker registry like https://hub.docker.com/ or private Docker registry in your company and SQLFlow can have access to. Make sure that the published Docker image have public access (anyone can pull the image without credentials) so that SQLFlow can pull the image and run the training steps.
For a SELECT program using the SQLFlow syntax extension, the SQLFlow server converts it into a workflow and submit the workflow job to a workflow engine like Argo/Tekton on Kubernetes. Each step in the workflow is one SQL statement.
By default, we use a default Docker image to run the training, predicting or explaining job. The default Docker image contains pre-made Tensorflow estimator models, Keras models defined in sqlflow_models repo and XGBoost. To use a custom model definition Docker image, write SQL statements mentioned above:
SELECT * FROM employee WHERE onboard_year < 2019 TO TRAIN a_data_scientist/regressors:v0.2/MyDNNRegressor WITH model.hidden_units=[1024,256], LABEL salary INTO my_first_model;
Then the generated workflow will use the Docker image
a_data_scientist/regressors:v0.2 to run the statement. In this step, SQLFlow will generate the training Python program inside the Docker container and execute it. Once the job completes, the trained model together with hyperparameters, Docker image, evaluation result and the SQL statement used will be saved. So in one trained model, we can have:
- Trained model weights.
- Docker image like
- Model class name, like
- Hyperparameters including model parameters, training parameters, and data conversion parameters.
- The model evaluation result JSON.
- The full SQL statement.
It requires more steps to publish a trained model. We need a public registry like DockerHub.com, we also need a public server to store trained models so that the data scientists can publish and share their trained models.
We propose the following extended SQL syntax for analysts
an_analyst to publish her trained model.
SQLFLOW PUBLISH my_first_model [TO https://models.sqlflow.org/user_name]
This statement uploads the model parameters and other information to the registry service, which defaults to https://models.sqlflow.org, and under the account
user_name, which defaults to
an_analyst in the above example.
Then, another analyst should be able to use the trained model by referring to it in its full name.
SELECT ... TO PREDICT employee.predicted_salary USING models.sqlflow.org/an_analyst/my_first_model
We describe the details of how to implement
models.sqlflow.org in model_market.
After all, what is a model zoo? A model zoo refers to all the model definitions and trained models accessible by a setup of the SQLFlow server. It contains one or more model definition Docker images and the source code repositories that build the images. It also includes the model zoo table configured to work with all submitter programs generated by the deployment.
Within a deployment, it is straightforward to share a trained model. If the analyst,
an_analyst, in the above example wants to use her own trained model
my_first_model for prediction, she could use the short name
SELECT ... TO PREDICT ... USING my_first_model
If another analyst wants to use the trained model, he would need to use the full name as
SELECT ... TO PREDICT ... USING an_analyst/my_first_model
Use a published model make some predictions on new data:
SELECT ... TO PREDICT employee.predicted_salary USING models.sqlflow.org/an_analyst/my_first_model
Use a published model to train on a new dataset:
SELECT * FROM employee WHERE onboard_year < 2019 TO TRAIN an_analyst/regressors:v0.2/MyDNNRegressor USING models.sqlflow.org/an_analyst/my_first_model;
There are three roles in the ecosystem of SQLFlow:
- the tool developers who use Go/C++ to create SQLFlow,
- the model developers, or data scientists, who use Python to define, say, Keras models, and
- the analysts, who use SQL to train models, or to use trained models for prediction or model explanation.
Any data scientist can create an arbitrary number of model zoos, and in each model zoo, there could be any number of model definitions. There are some concepts from the perspective of a data scientist
- A model zoo is a Git repo, say,
- A model zoo is built and published in the form of a Docker image, say,
From the perspective of an analyst, say,
an_analyst, who is going to use the model definition
DNNRegressor defined in github.com/a_data_scientist/regressors, s/he could refer to the model definition by
- dockerhub.com/a_data_scientist/regressor:v0.2/DNNRegressor, which is the full name,
- dockerhub.com/a_data_scientist/regressor/DNNRegressor, if the Docker image tag is
- a_data_scientist/regressor/DNNRegressor, if dockerhub.com is the default Docker registry, or
- regressor/DNNRegressor, if the user is the model developer
DNNRegressor could be in any file in any directory of the repository github.com/a_data_scientist/regressor. The only requirement is that when the submitter program trains the model, it can run a Docker command, or an equivalent Docker API call, to train the model.
A trained model could have its name in any of the following forms:
a_dmbs_server.somewhere.com defaults to the SQL engine that hosts the data source.
a_database_project defaults to
my_first_model are in a row in the database table