Although the machine learning model is widely used in many fields, it remains mostly a black box. SHAP is widely used by data scientists to explain the output of any machine learning model.
This design doc introduces how to support the
Analyze SQL in SQLFlow with SHAP as the backend and display the visualization image to the user.
Users usually use a TRAIN SQL to train a model and then analyze the model using an ANALYZE SQL, the simple pipeline like:
SELECT * FROM train_table TRAIN xgboost.Estimator WITH train.objective = "reg:linear" COLUMN x LABEL y INTO my_model;
SELECT * FROM train_table ANALYZE my_model WITH plots = force USING TreeExplainer
train_tableis the table of training data.
my_modelis the trained model.
summaryis the visualized method.
TreeExplaineris the explain type.
The Analyze SQL would display the visualization image on Jupyter like:
- Enhance the SQLFlow parser to support the
- Implement the
codegen_shap.goto generate a SHAP Python program. The Python program would be executed by SQLFlow
Executormodule and prints the visualization image in HTML format to stdout. The stdout will be captured by the Go program using CombinedOutput.
- For each
Analyze SQLrequest from the SQLFlow magic command, the SQLFlow server would response the HTML text as a single message, and then display the visualization image on Jupyter Notebook
- For the current milestone, SQLFlow only supports DeepExplainer for the Kerase Model, and TreeExaplainer for the XGboost, more abundant Explainer and Model type will be supported in the future.
- We don’t use the more relevant keyword
Explainis used throughout various SQL databases.