Link

Model Definition and Model Zoo

Introduction

SQLFlow extends SQL syntax to allow SQL programmers, a.k.a., analysts, to invoke models defined by Python programmers, a.k.a., data scientists. For each deployment of SQLFlow service, we refer to the collection of model definitions accessible by analysts as a model zoo. A model zoo contains not only the model definitions but also the trained model parameters, as well as the hyperparameters and other information, which are necessary when we use the model for prediction and other analytics jobs.

This document is about how to define models and how to build a model zoo. For the conclusion, please go to the last section of this document.

Background

The following example SQL statement shows the syntax for training a model.

SELECT * FROM employee WHERE onboard_year < 2019
TO TRAIN a_data_scientist/regressors:v0.2/MyDNNRegressor
WITH hidden_units=[10,50,5], lr=0.01
COLUMN gender, scale(age, 0, 1), bucketize(address, 100) 
LABEL salary 
INTO my_first_model;

The string a_data_scientist/regressors:v0.2 names a Docker image, inside which, there is MyDNNRegressor, a Python class derived from tf.keras.Model, and its dependencies. The WITH clause provides hyperparameters required by the constructor of MyDNNRegressor and the training process. The COLUMN clause specifies how to convert the SELECT result, a table, into model inputs in the form of tensors. LABEL identifies the field used as the label, in the case of supervised learning. The training process saves all the above information, plus the estimated model parameters, under the name my_first_model.

The following example fills in the column predicted_salary of the table employee for rows that represent employees recruited in and after 2019.

SELECT * FROM employee WHERE onboard_year >= 2019 
TO PREDICT employee.predicted_salary
USING my_first_model;

Users don’t have to write the COLUMN clause in the above example, as SQLFlow reuses the one in the training statement.

Concepts

The above training and prediction example reveals some concepts related to models.

  1. A dataset is defined and generated by a SQL SELECT statement. For example, SELECT * FROM employee WHERE onboard_year < 2019.

  2. A data source is a table, for example, employee. Please be aware that we can cache and reuse a dataset until its data sources are changed.

  3. A model definition, for example, MyDNNRegressor, is a Python class or other kinds of code that specify a model.

  4. The SQL data to model inputs conversion, or data conversion, is the COLUMN and optionally LABEL clauses.

  5. Hyperparameters
    1. Model hyperparameters are some key-value pairs specified in the WITH clause. They are arguments to the constructor of the model definition.
    2. Training hyperparameters appear in WITH, as model constructors do. They affect the training process.
    3. Data conversion hyperparameters appear in COLUMN and optionally LABEL. For example, the scale range and bucket size in the first example.
  6. A trained model consists of all the above concepts and the estimated model parameters.

In SQLFlow SQL grammar, the identifiers after TRAIN, USING and INTO have different meanings:

  1. TRAIN IDENT: IDENT is the name of a model definition.
  2. USING IDENT: IDENT is the model ID referring to a trained model, please refer to the below sections for the definition of model ID.
  3. INTO IDENT: IDENT is the model ID referring to a trained model.

The Design

Versioning and Releasing

A key to this design is to know that both the model definition and the trained model have versions. Suppose that an analyst trained a model using the definition in my_dnn_regressor.py and got my_first_model; soon after that, a data scientist changed my_dnn_regressor.py and the analyst re-trained the model into my_second_model with slightly modified hyperparameter settings. The analyst must be aware that he should use the same version of my_dnn_regressor.py to train these two models. Then if the analyst decides to publish the trained model my_second_model for online prediction, he should choose to use my_second_model rather than my_first_model.

We can use version management tools like Git and release engineering tools like Docker and save the trained model with a unique name. Here follows our proposal.

  1. A collection of model definitions is a Git repository of source files.
  2. To describe dependencies, we require a Dockerfile at the root directory of the repository.
  3. To release a repository, we checkout the specific version and run docker build and docker push with the Dockerfile.
  4. Each trained model is saved under the current user’s namespace with a unique name.

Develop Custom Model Definition

Model developers can have their own Git repository or other code version control systems to store the code of model definition Python files. The Python code files should be put into one directory as a Python package, then write a Dockerfile under the root directory for building the image:

- your_model_package/
-- __init__.py
-- some_model.py
-- test_some_model.py
- Dockerfile

Here is a template of what the Dockerfile should look like:

# base image sqlflow/modelzoo_base have SQLFlow environment prepared
FROM sqlflow/modelzoo_base
RUN pip install scikit-learn six
# copy model definition code to /sqlflow_models
ADD your_model_package /sqlflow_models
# add PYTHONPATH environment variable to /sqlflow_models
ENV PYTHONPATH /sqlflow_models

Model developers can do continuous development using this repository, forking branches, adding tags, running unit tests, etc.

Build and Publish the Docker Image

When the model developer needs to publish the model for SQLFlow to use, they need to build the Docker image and push the image to a central repository.

docker build -t a_data_scientist/regressors:v0.2 .
docker push a_data_scientist/regressors:v0.2

You can publish the image to a public Docker registry like https://hub.docker.com/ or private Docker registry in your company and SQLFlow can have access to. Make sure that the published Docker image have public access (anyone can pull the image without credentials) so that SQLFlow can pull the image and run the training steps.

Use the Model Definition in the Docker Image

For a SELECT program using the SQLFlow syntax extension, the SQLFlow server converts it into a workflow and submit the workflow job to a workflow engine like Argo/Tekton on Kubernetes. Each step in the workflow is one SQL statement.

By default, we use a default Docker image to run the training, predicting or explaining job. The default Docker image contains pre-made Tensorflow estimator models, Keras models defined in sqlflow_models repo and XGBoost. To use a custom model definition Docker image, write SQL statements mentioned above:

SELECT * FROM employee WHERE onboard_year < 2019
TO TRAIN a_data_scientist/regressors:v0.2/MyDNNRegressor
WITH model.hidden_units=[1024,256],
LABEL salary
INTO my_first_model;

Then the generated workflow will use the Docker image a_data_scientist/regressors:v0.2 to run the statement. In this step, SQLFlow will generate the training Python program inside the Docker container and execute it. Once the job completes, the trained model together with hyperparameters, Docker image, evaluation result and the SQL statement used will be saved. So in one trained model, we can have:

  1. Trained model weights.
  2. Docker image like TRAIN a_data_scientist/regressors:v0.2.
  3. Model class name, like MyDNNRegressor.
  4. Hyperparameters including model parameters, training parameters, and data conversion parameters.
  5. The model evaluation result JSON.
  6. The full SQL statement.

Publish Trained Model

It requires more steps to publish a trained model. We need a public registry like DockerHub.com, we also need a public server to store trained models so that the data scientists can publish and share their trained models.

We propose the following extended SQL syntax for analysts an_analyst to publish her trained model.

SQLFLOW PUBLISH my_first_model
    [TO https://models.sqlflow.org/user_name]

This statement uploads the model parameters and other information to the registry service, which defaults to https://models.sqlflow.org, and under the account user_name, which defaults to an_analyst in the above example.

Then, another analyst should be able to use the trained model by referring to it in its full name.

SELECT ... TO PREDICT employee.predicted_salary USING models.sqlflow.org/an_analyst/my_first_model

We describe the details of how to implement models.sqlflow.org in model_market.

Sharing Models

After all, what is a model zoo? A model zoo refers to all the model definitions and trained models accessible by a setup of the SQLFlow server. It contains one or more model definition Docker images and the source code repositories that build the images. It also includes the model zoo table configured to work with all submitter programs generated by the deployment.

Within a deployment, it is straightforward to share a trained model. If the analyst, an_analyst, in the above example wants to use her own trained model my_first_model for prediction, she could use the short name my_first_model.

SELECT ... TO PREDICT ... USING my_first_model

If another analyst wants to use the trained model, he would need to use the full name as an_analyst/my_first_model.

SELECT ... TO PREDICT ... USING an_analyst/my_first_model

Use a published model make some predictions on new data:

SELECT ... TO PREDICT employee.predicted_salary USING models.sqlflow.org/an_analyst/my_first_model

Use a published model to train on a new dataset:

SELECT * FROM employee WHERE onboard_year < 2019
TO TRAIN an_analyst/regressors:v0.2/MyDNNRegressor
USING models.sqlflow.org/an_analyst/my_first_model;

Summarization

There are three roles in the ecosystem of SQLFlow:

  1. the tool developers who use Go/C++ to create SQLFlow,
  2. the model developers, or data scientists, who use Python to define, say, Keras models, and
  3. the analysts, who use SQL to train models, or to use trained models for prediction or model explanation.

Any data scientist can create an arbitrary number of model zoos, and in each model zoo, there could be any number of model definitions. There are some concepts from the perspective of a data scientist a_data_scientist:

  1. A model zoo is a Git repo, say, github.com/a_data_scientist/regressors.
  2. A model zoo is built and published in the form of a Docker image, say, dockerhub.com/a_data_scientist/regressors:v0.2.

From the perspective of an analyst, say, an_analyst, who is going to use the model definition DNNRegressor defined in github.com/a_data_scientist/regressors, s/he could refer to the model definition by

  • dockerhub.com/a_data_scientist/regressor:v0.2/DNNRegressor, which is the full name,
  • dockerhub.com/a_data_scientist/regressor/DNNRegressor, if the Docker image tag is latest,
  • a_data_scientist/regressor/DNNRegressor, if dockerhub.com is the default Docker registry, or
  • regressor/DNNRegressor, if the user is the model developer a_data_scientist.

The class DNNRegressor could be in any file in any directory of the repository github.com/a_data_scientist/regressor. The only requirement is that when the submitter program trains the model, it can run a Docker command, or an equivalent Docker API call, to train the model.

A trained model could have its name in any of the following forms:

  • my_first_model
  • an_analyst/my_first_model
  • a_database_project/an_analyst/my_first_model
  • a_dbms_server.somewhere.com/a_database_project/an_analyst/my_first_model

a_dmbs_server.somewhere.com defaults to the SQL engine that hosts the data source. a_database_project defaults to sqlflow. an_analyst and my_first_model are in a row in the database table trained_models.