Explain the Machine Learning Model in SQLFlow
Concept
Although the machine learning model is widely used in many fields, it remains mostly a black box. SHAP is widely used by data scientists to explain the output of any machine learning model.
This design doc introduces how to support the Explain SQL in SQLFlow with SHAP as the backend and display the visualization image to the user.
User Interface
Users usually use a TO TRAIN SQL to train a model and then explain the model using an TO EXPLAIN SQL, the simple pipeline like:
Train SQL:
SELECT * FROM train_table
TO TRAIN xgboost.Estimator
WITH
train.objective = "reg:linear"
COLUMN x
LABEL y
INTO my_model;
Explain SQL:
SELECT * FROM train_table
TO EXPLAIN my_model
WITH
plots = force
USING TreeExplainer
where:
train_tableis the table of training data.my_modelis the trained model.forceandsummaryis the visualized method.TreeExplaineris the explain type.
The Explain SQL would display the visualization image on Jupyter like:

Implement Details
- Enhance the SQLFlow parser to support the
Explainkeyword. - Implement the
codegen_shap.goto generate a SHAP Python program. The Python program would be executed by SQLFlowExecutormodule and prints the visualization image in HTML format to stdout. The stdout will be captured by the Go program using CombinedOutput. - For each
Explain SQLrequest from the SQLFlow magic command, the SQLFlow server would response the HTML text as a single message, and then display the visualization image on Jupyter Notebook
Note
- For the current milestone, SQLFlow only supports DeepExplainer for the Keras Model, and TreeExplainer for XGBoost, more abundant Explainer and Model type will be supported in the future.
- We don’t use the more relevant keyword
Explainjust becauseExplainis used throughout various SQL databases.