Auto Hyperparameter Tuning
SQLFlow allows the users to specify hyperparameter values via the WITH
clause when training models. However, most users under our survey prefer that SQLFlow could automatically estimate these hyperparameters instead. This document is about the automatic hyperparameter estimation.
Katib
Katib is a Kubernetes Native System for Hyperparameter Tuning and Neural Architecture Search. The inspiration of Katib comes from Google Vizier and supports multiple machine learning frameworks, for example, TensorFlow, Apache MXNet, PyTorch, and XGBoost. We compared Katib with some other auto hyperparameter tuning systems, and we prefer its Kubernetes-native architecture.
However, Katib, or hyperparameter tuning in the academic literature, is not sufficient for our use case.
The Paradox
To define a training job, a.k.a., experiment, in Katib, users need to specify the search range of each hyperparameter.
Ironically, it is an extra burden for the users to specify the above information, as our goal is to release users from specifying hyperparameters.
Untangle the Paradox
Boosting Tree Models
For boosting tree models, especially models with XGBoost, there is a small group of effective hyperparameters, and we can empirically determine their ranges. We noticed that the following two are the most important.
max_depth
in the range [2,10], andnum_round
in the range [50, 100].
With the introduction of auto hyperparameter tuning, we hope that users don’t need to specify the num_round
and max_depth
values in the following SQL statement.
SELECT * FROM train_data_table
TO TRAIN a_data_scientist/xgboost:v2/xgboost.gbtree
WITH
objective=multi:softmax,
eta=0.1,
range.num_round=[50, 100],
range.max_depth=[2, 8],
validation_dataset="SELECT * FROM test_data_table;"
LABEL class
INTO my_xgb_model;
Deep Learning Models
For deep learning models, the case is complicated. Each model has its own set of hyperparameters, and their ranges might vary significantly. Our proposed solution is to utilize the model zoo. In particular, users might train a model defined in the zoo with various datasets, in various experiments, with manually specified hyperparameters. After the training, some users might publish their trained models, including the estimated parameters and the specified hyperparameters. Given these published hyperparameter values, the Bayesian hyperparameter optimization for hyperparameter tuning. We are working on such a Bayesian approach that doesn’t require explicit specification of hyperparameter ranges. We plan to contribute it to Katib.
Trigger Hyperparameter Tuning
Each model definition has a specification listing its hyperparameters. If the user-specified values of all of them, there is no need for tuning; otherwise, SQLFlow should call Katib.
The System Design
From Submitter to Couler
SQLFlow has been working as converting a SQL program into a Python program known as a submitter before executing the submitter. However, we recently realized that the idea of the submitter is insufficient for cloud deployment. As Kubernetes might preempt the SQLFlow server, it could lose the status of the execution of submitters.
This observation urges us to make the following changes.
- Introducing a workflow engine, namely Argo.
- Make SQLFlow generates a workflow instead of a Python program.
- SQLFlow server submits the workflow to Argo for the execution.
- Argo manages the status of workflow executions.
Argo takes workflows in the form of YAML files, and it is error-prone to write such YAML files manually. So, we created Couler as an intermediate programmatic representation of workflows.
We need to develop a new codegen, codegen_couler.go
, for SQLFlow. codegen_couler.go
converts the parsed SQL program, a.k.a., the intermediate representation, or IR, into a Couler program.
The Integration via Couler
SQLFlow parses each SQL program into an IR, which is a list of statement IRs. The codegen_couler.go
converts the IR into a Couler program. We need to add a Couler functions couler.sqlflow.train
for the calling by the generated Couler program.
Consider the following example program.
SELECT * FROM a, b WHERE a.id = b.id INTO c;
SELECT * FROM c TO TRAIN data_scientist/xgboost:v0.5/katib.xgboost.gbtree
WITH objective=multi:softmax, eta=0.1, validation_dataset="select * from d;"
INTO my_xgb_model;
katib.xgboost.gbtree
indicates to use Katib to train the model xgboost.gbtree
. Then the codegen_couler.go
might generate the following Couler program.
couler.maxcompute.run("""SELECT * FROM a, b WHERE a.id = b.id INTO c;""")
couler.sqlflow.katib.train(model="xgboost", hyperparameters={ "booster": "gbtree", "objective": "multi:softmax",
"eta": 0.1}, image="data_scientist/xgboost:v0.5",
sql="select * from c to train data_scientist/xgboost:v0.5/xgboost.gbtree ... ",
datasource="mysql://..." )
couler.sqlflow.katib.train(...)
Considering Katib itself supports multiple models and frameworks, and more may come in the future, we introduce the following Couler function.
def couler.sqlflow.katib.train(model=None, hyperparameters={}, image=None, sql=None,datasource=None)
The arguments in couler.sqlflow.katib.train
,
model
defines the training model, e.g.,xgboost
.hyperparameters
specifies hyperparameters for model given inmodel
.image
specifies the container image source for the Katib tuning job.sql
sql statement input by users.datasource
train and validation data source
Run Tuning Job on Katib
In each Katib tuning job, users need to define tuning parameters (i.e., the hyperparameter’s name, type, and range) in a model at first. During runtime, the Katib will pick up different values for those hyperparameters and start a single Pod for each value set. Then the tuning job Pods, which are running customized container image, must follow the Katib input format and take those hyperparameters’ values from Katib, to train and measure the model.
For example, users may define the following command for tuning job Pod:
python -m runtime.couler.katib.xgboost_train
The actual command during runtime will be:
python -m runtime.couler.katib.xgboost_train --max_depth 5 ...
, hyperparameter max_depth
is added by Katib.
The pipeline is as following:
- SQLFlow parses the input SQL statement and extract tuning hyperparameters, image, and model.
couler_codegen.go
generatescouler_submitter.py
which will invokecouler.sqlflow.katib.train
in the submitter program.- SQLFlow executes
couler_submitter.py
and invokecouler.sqlflow.katib.train
to fill a Katib step in the Argo workflow. - Argo executes the workflow YAML and create Katib tuning job on Katib.
- The Katib job starts multiple tuning Pods. In each Pod, it executes command
python -m xgb_train --max_depth 5
.
In the last step, it runs following codes in xgb_train
:
setenv("MAX_DEPTH", 5)
run_cmd("repl -m \"select * ...\" ")