SQLFlow extends SQL grammar to support data pre-processing using
For example, we can use
CATEGORY_HASH to parse a string column to an integer column, which is usually a common data pre-processing operation in NLP tasks.
SELECT string_column1, int_column2, class FROM xgboost.gbtree TO TRAIN XGBoostModel COLUMN INDICATOR(CATEGORY_HASH(string_column1, 10)), int_column2 LABEL class INTO sqlflow_xgboost_model.my_model;
COLUMN clauses are supported in SQLFlow TensorFlow models. The
COLUMN clauses are transformed
into TensorFlow feature column API calls inside SQLFlow codegen implementation.
However, XGBoost has no similar feature column APIs as TensorFlow. Currently, XGBoost models can only support simple column names like
c1, c2, c3 in
and any data pre-processing is not supported. It makes that we cannot use XGBoost to train models which accept string column as their input.
This design explains how SQLFlow supports feature columns in XGBoost model.
COLUMN clauses support the following listed feature columns, and they are implemented by TensorFlow APIs.
|SQLFlow keywords||TensorFlow API||Description|
|DENSE||tf.feature_column.numeric_column||Raw numeric feature column without any pre-processing|
|BUCKET||tf.feature_column.bucketized_column||Transform input integer to be the bucket id divided by boundaries|
|CATEGORY_ID||tf.feature_column.categorical_column_with_identity||Identity mapping of integer feature column|
|CATEGORY_HASH||tf.feature_column.categorical_column_with_hash_bucket||Using hash algorithm to map string or integer to category id|
|SEQ_CATEGORY_ID||tf.feature_column.sequence_categorical_column_with_identity||Sequence data version of CATEGORY_ID|
|CROSS||tf.feature_column.crossed_column||Combine multiple category features using hash algorithm|
|INDICATOR||tf.feature_column.indicator_column||Transform category id to multi-hot representation|
|EMBEDDING||tf.feature_column.embedding_column||Transform category id to embedding representation|
The training process of XGBoost model inside SQLFlow are as follows:
- Step 1: Read data from database. Call the
db.db_generator()method to return a Python generator which yields each row in database.
- Step 2: Dump SVM file. Call the
dump_dmatrix()method to write the raw data into an SVM file. This file is ready to be loaded as XGBoost DMatrix.
- Step 3: Training. Load the dumped file to be XGBoost DMatrix and start to train. The training process is performed by calling
As discussed in COLUMN clause for XGBoost, there are 3 candidate ways to support feature column in XGBoost models:
Method 1. Perform feature column transformation during step 1 and step 2. Data pre-processing can be done before dumping to SVM file. This method is suitable for offline training (both standalone and distributed), prediction and evaluation, since same transformation Python codes can be generated in both training, prediction and evaluation. But it is not suitable for online serving, because online serving usually uses other libraries or languages (like C++/Java), which does not support the transformation codes we generate in SQLFlow Python codes.
Method 2. Modify the training iteration of XGBoost, and insert transformation codes during each iteration. But it is not easy to modify the training iteration of XGBoost. Moreover, it is also not suitable for online serving for the same reason as method 1.
Method 3. Combine data pre-processing and model training as a sklearn pipeline. Since sklearn pipeline can be saved as a PMML file by sklearn2pmml or Nyoka, this method is suitable for standalone training, offline prediction, offline evaluation and online serving. Distributed training of sklearn pipeline can be performed using Dask. However, distributed training pipeline cannot be saved as a PMML file directly. It is because Dask does not use but mocks native sklearn APIs to build a pipeline, and these mocked APIs cannot be saved. Another problem is that sklearn pipeline only supports very few data pre-processing transformers. For example, hashing a string to a single integer is not supported in sklearn. Of course, we can add more data pre-processing transformers, but these new added transformers cannot be saved as a PMML file.
In summary, one of the most critical things is how data pre-processing transformers can be saved for online serving. After investigating the online serving platform in the company (Arks, etc), data pre-processing steps are usually not saved in PMML or Treelite files. The online serving platform provides plugins for users to choose their data pre-processing steps instead of loading them from PMML or Treelite files. Therefore, we prefer to choose Method 1 to implement feature column in XGBoost models.
The feature column transformers in Python can be implemented as:
class BaseFeatureColumnTransformer(object): def __call__(self, inputs): raise NotImplementedError() def set_column_names(self, column_names): self.column_names = column_names class NumericColumnTransformer(BaseFeatureColumnTransformer): # `key` is the column name inside `SELECT` statement def __init__(self, key): self.key = key def set_column_names(self, column_names): BaseFeatureColumnTransformer.set_column_names(column_names) self.index = self.column_names.index(self.key) # `inputs` are all raw column data # NumericColumnTransformer would only take the column indicated by `index` def __call__(self, inputs): return inputs[self.index] # CategoricalColumnTransformer is the base class of all category columns # This base class is design to do some check. For example, `INDICATOR` # would only accept category column as its input. class CategoricalColumnTransformer(BaseFeatureColumnTransformer): pass class BucketizedColumnTransformer(CategoricalColumnTransformer): def __init__(self, key, boundaries, default_value=None): self.key = key self.boundaries = boundaries self.default_value = default_value def set_column_names(self, column_names): BaseFeatureColumnTransformer.set_column_names(column_names) self.index = self.column_names.index(self.key) def __call__(self, inputs): input = inputs[self.index] if input < boundaries: return 0 for idx, b in enumarate(boudaries): if input >= b return idx return len(boundaries) class CrossedColumnTransformer(BaseFeatureColumnTransformer): def __init__(self, keys, hash_bucket_size): self.keys = keys self.hash_bucket_size = hash_bucket_size def set_column_names(self, column_names): BaseFeatureColumnTransformer.set_column_names(column_names) self.column_indices = [self.column_names.index(key) for key in self.keys] def _cross(self, transformed_inputs): ... def __call__(self, inputs): selected_inputs = [inputs[idx] for idx in self.column_indices] self._cross(selected_inputs) class ComposedFeatureColumnTransformer(BaseFeatureColumnTransformer): def __init__(self, *transformers): self.transformers = transformers def set_column_names(self, column_names): BaseFeatureColumnTransformer.set_column_names(column_names) for t in self.transformers: t.set_column_names(column_names) def __call__(self, inputs): return [t(inputs) for t in self.transformers]
For example, the column clause
COLUMN INDICATOR(CATEGORY_HASH(string_column1, 10)), int_column2 would be finally transformed into Python calls:
transform_fn = ComposedFeatureColumnTransformer( IndicatorColumnTransformer(CategoryColumnWithHashBucketTransformer(key="string_column1", hash_bucket_size=10)), NumericColumnTransformer(key="int_column2") )
Then we pass
runtime.xgboost.train method. Inside
runtime.xgboost.train, we transform the
raw data from
db.db_generator(...) by calling
transform_fn.__call__ method. Method
set_column_names would be called once
when the table schema is obtained in runtime, so that the index of
key can be inferred in Python runtime. The transformed data
would be writen into SVM file, then it can be loaded in the following train step.
Another concern is that we should perform the same data pre-processing in prediction/evaluation stage. So we should save the feature columns of training, so that it can be loaded in prediction/evaluation stage. Besides, the codegen during prediction/evaluation stage should also generate the same transformation codes as training stage.
It should be noticed that
EMBEDDING is not supported in this design doc. It is because that the
EMBEDDING feature column may contain
trainable parameters, and these parameters cannot be updated in XGBoost training process.
XGBoost supports 2 kinds of APIs to train a model:
xgboost.train. We use this API in our current implementation. The returned Booster can be saved to the format that can be loaded by Treelite APIs but not by PMML APIs.
- We must distinguish whether the model is a classifier/regressor/ranker beforehand.
- The constructors of
xgboost.XGBClassifier/XGBRegressor/XGBRankermix up the Booster parameters and training parameters together. For example,
boosteris one of the Booster parameters, and
n_estimatorsis one of the training parameters, but both of them appear in the constructors. This makes us hard to distinguish model parameters and training parameters in SQLFlow codes.
- Names of some of the parameters in
xgboost.XGBClassifier/XGBRegressor/XGBRankerare different from
xgboost.train. For example,
xgboost.XGBClassifier/XGBRegressor/XGBRankeris the same as
Therefore, we prefer to use
xgboost.train API to perform training in this design, and export PMML/Treelite files in the following ways:
- PMML file can be exported by:
Booster.load_model()to load trained model.
- Check the Booster objective to build one of
xgboost.XGBClassifier/XGBRegressor/XGBRanker.load_model()to load the trained model again.
- Build a sklearn pipeline using the pre-built
- Save the pipeline using Sklearn2pmml or Nyoka as a PMML file.
- Treelite file can be exported by Model.from_xgboost using the Booster saved by