XGBoost on SQLFlow Tutorial
This is a tutorial on train/predict XGBoost model in SQLFLow, you can find more SQLFlow usage from the Language Guide, in this tutorial you will learn how to:
- Train a XGBoost model to fit the boston housing dataset; and
- Predict the housing price using the trained model;
The Dataset
This tutorial would use the Boston Housing as the demonstration dataset. The database contains 506 lines and 14 columns, the meaning of each column is as follows:
Column | Explain |
---|---|
crim | per capita crime rate by town. |
zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
indus | proportion of non-retail business acres per town. |
chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
nox | nitrogen oxides concentration (parts per 10 million). |
rm | average number of rooms per dwelling. |
age | proportion of owner-occupied units built prior to 1940. |
dis | weighted mean of distances to five Boston employment centres. |
rad | index of accessibility to radial highways. |
tax | full-value property-tax rate per $10,000. |
ptratio | pupil-teacher ratio by town. |
black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
lstat | lower status of the population (percent). |
medv | median value of owner-occupied homes in $1000s. |
We separated the dataset into train/test dataset, which is used to train/predict our model. SQLFlow would automatically split the training dataset into train/validation dataset while training progress.
%%sqlflow
describe boston.train;
%%sqlflow
describe boston.test;
Fit Boston Housing Dataset
First, let’s train an XGBoost regression model to fit the boston housing dataset, we prefer to train the model for 30 rounds
,
and using squarederror
loss function that the SQLFLow extended SQL can be like:
TO TRAIN xgboost.gbtree
WITH
train.num_boost_round=30,
objective="reg:squarederror"
xgboost.gbtree
is the estimator name, gbtree
is one of the XGBoost booster, you can find more information from here.
We can specify the training data columns in COLUMN clause
, and the label by LABEL
keyword:
COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
LABEL medv
To save the trained model, we can use INTO clause
to specify a model name:
INTO sqlflow_models.my_xgb_regression_model
Second, let’s use a standard SQL to fetch the training data from table boston.train
:
SELECT * FROM boston.train
Finally, the following is the SQLFlow Train statement of this regression task, you can run it in the cell:
%%sqlflow
SELECT * FROM boston.train
TO TRAIN xgboost.gbtree
WITH
objective="reg:squarederror",
train.num_boost_round = 30
COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
LABEL medv
INTO sqlflow_models.my_xgb_regression_model;
Predict the Housing Price
After training the regression model, let’s predict the house price using the trained model.
First, we can specify the trained model by USING clause
:
USING sqlflow_models.my_xgb_regression_model
Than, we can specify the prediction result table by TO PREDICT clause
:
TO PREDICT boston.predict.medv
And using a standard SQL to fetch the prediction data:
SELECT * FROM boston.test
Finally, the following is the SQLFLow Prediction statement:
%%sqlflow
SELECT * FROM boston.test
TO PREDICT boston.predict.medv
USING sqlflow_models.my_xgb_regression_model;
Let’s have a glance at prediction results.
%%sqlflow
SELECT * FROM boston.predict;