Extend the SPARSE Column Transformer to Load Sparse Vectors in Different Data Formats
Background
In the TO TRAIN
syntax, users can write a COLUMN
clause to specify a table column’s transformation into a model input. Among the supported column transformers, DENSE
and SPARSE
are used to transform the numeric-typed values:
- The
DENSE
transformer assumes that each table cell contains a number or a dense vector. - The
SPARSE
transformer believes that each table cell contains a sparse vector.
In the current implementation, both DENSE
and SPARSE
can parse a string of integers. DENSE
assumes that each value in the string is an element in a dense vector. SPARSE
thinks that each value is an element index, and all element values are 1. For example, DENSE
parses the string “5,6,7” into a three-dimensional dense vector [5,6,7], and SPARSE
parses it into a sparse vector {5:1.0, 6:1.0, 7:1.0}.
Recently, users reported a feature request out of the capability of the SPARSE
implementation. The string-encoded sparse vector is in the key-value form of "0:1.2 1:3.4 2:5.6"
. It represents a sparse vector x
and x[0] = 1.2
, x[1] = 3.4
, x[2] = 5.6
. In this key-value form, the whitespace is optional, and the key-value separator does not have to be a colon.
We can indeed add a new column transformer for this case. But it would make the SQLFlow APIs more complex, and disobey the principle of Occam’s Razor. Therefore, we propose to extend the SPARSE
transformer to support the new case. In the feature derivation stage, SQLFlow should infer the data format automatically, including the element separator like the comma, the key-value separator like the colon, and whether the data format is in the form of "5,6,7"
or "0:1.2 1:3.4 2:5.6"
.
Proposed Design
We would add a field named format
in FieldDesc
. It may be csv
("5,6,7"
), kv
("0:1.2 1:3.4 2:5.6"
), or other data format we would support in the future.
type FieldDesc struct {
Name string
DType int
Delimiter string
Shape []int
IsSparse bool
Format string // indicates the data format
...
}
In the feature derivation stage, we can use a regular expression to infer the data format. To avoid too much regular expression matching, this matching would be done only once when inferring the first row of the fetched samples. That is to say (in pseudo codes):
func InferFeatureColumns() {
rows := FetchSamples()
rowCount := 0
for rows.Next() {
if rowCount == 0 {
format := inferDataFormat(rows.Value()) // Use regular expression to infer the data format
if format == "kv" {
// Fill FieldDesc info when the data format is in the key-value form
// For example: FieldDesc.Format = "kv", FieldDesc.IsSparse = true, etc.
...
} else { // other supported data format
...
}
}
rowCount ++
}
}
In the Python code generation, we would use the data format information inferred in the feature derivation stage.
- For TensorFlow models: we would convert the data in the key-value form into
tensorflow.SparseTensor
for training, prediction, evaluating, and explaining. - For XGBoost models: we would dump the data in the key-value form into LibSVM format files, and then SQLFlow would load the files as
xgboost.DMatrix
for training, prediction, evaluating, and explaining.
SQL Statement Example
The SQL statement to load the sparse data from the table column is:
SELECT * FROM train_table
TO TRAIN xgboost.gbtree
WITH
objective="reg:squarederror",
train.num_boost_round = 30
COLUMN SPARSE(column_name, 10000)
LABEL label
INTO result_table;
Users should write SPARSE(column_name, length)
to indicate that the column column_name
stores the sparse data. We would detect the data format of the column_name
in the feature derivation stage automatically.
The length
parameter in SPARSE
is not required. If users do not provide the length
parameter, we will derive the dense length of the data in the feature derivation stage.