XGBoost on SQLFlow


This design explains how SQLFlow calls XGBoost for training models and prediction.


To explain the benefit of integrating XGBoost with SQLFlow, let us start with an example. The following SQLFlow code snippet shows how users can train an XGBoost tree model named my_xgb_model.

SELECT * FROM train_table
TO TRAIN xgboost.gbtree
LABEL class
INTO my_xgb_model;

The following example shows how to predict using the model my_xgb_model.

SELECT * FROM test_table
TO PREDICT pred_table.result
USING my_xgb_model;

The the above examples,

  • my_xgb_model names the trained model.
  • xgboost.gbtree is the model name, to use a different model provided by XGBoost, use xgboost.gblinear or xgboost.dart, see: here for details.
  • In the WITH clause,
    • objective names an XGBoost learning task
    • keys with the prefix train. identifies parameters of XGBoost API xgboost.train, and
    • keys without any prefix identifies XGBoost Parameters except the objective parameter, which was specified by the identifier after the keyword TO TRAIN, as explained above.

The Code Generator

The code generator codegen_xgboost.go outputs an XGBoost program in Python. It contains the following features:

  1. It tells the SQL engine to run the SELECT statement and retrieve the training/test data. It saves the data into a text file, which could be loaded by XGBoost using the DMatrix interface.
  2. Parse and resolve the WITH clause to fill the xgboost.train arguments and the XGBoost Parameters.
  3. Save the trained model on disk.
  4. For the TO PREDICT clause, it loads the trained model and test data and then outputs the prediction result to a SQL engine.