Information Necessary for Code Generators in SQLFlow

SQLFlow extends the syntax of the SELECT statement of SQL to support training a model:

SELECT * FROM kaggle_credit_fraud_training_data
LIMIT 1000
TO TRAIN DNNClassifier       /* a pre-defined TensorFlow estimator, tf.estimator.DNNClassifier */
WITH layers=[100, 200],   /* a parameter of the Estimator class constructor */
     train.batch_size = 8 /* a parameter of the Estimator.train method */
COLUMN *,                 /* all columns as raw features */
       cross(v1, v9, v28) /* plus a derived (crossed) column */
LABEL class
INTO sqlflow_models.my_model_table;      /* saves trained model parameters and features into a table */

Currently, we have the following syntax allowing users to provide necessary information for the training.

SELECT STATEMENT
TO TRAIN ...
WITH ...
COLUMN ... [FOR ...]
LABEL ...
INTO ...

SQLFlow server passes the above information to code generators like sql/codegen.go and sql/codegen_alps.go, which generates the training program may be using different and even conflicting rules.

Things will be even more difficult if we got other kinds of sql/codegen_**.go in the future.

In this document, we summarize information necessary for the code generators.

Necessary Information for Training

Model Name

Model Name is a string written after the keyword of TO TRAIN, which can be the name of a TensorFlow pre-made estimator or the full package path of a customized Estimator/KerasModel.

If the model name is the full package path of a customized model, the model should be installed according to How to install the customized model in SQLFlow.

Model Constructor Parameters

The model will be constructed by codegen according to the module name, and the constructor parameters can be set in the WITH block.

Here are some rules:

Name of a parameter must begin with model. and we take the rest of it as the real name.
To simplify implementation, the value of a parameter must be a type of numeric, string or list.

Take an example for DNNClassifier

TO TRAIN DNNClassifier
WITH
    model.hidden_units = [10, 20, 10]

Training Hyper-Parameters

The training hyper-parameters can be also set in the WITH block.

Here is the list of hyper-parameters planning to support.

	parameter name	type of value	default value	comment
batch size	train.batch_size	integer	512
drop remainder	train.drop_remainder	bool	true
cache	train.cache	bool/string	false	bool: enable cache in memory or not. string: enable cache and the name of a directory on the filesystem to caching tensors.
epoch	train.epoch	integer	1
shard	train.shard	integer	1	distributed training if greater than 1. Fixed size of PS btw.
shuffle	train.shuffle	bool/integer	false	bool: enable shuffle or not. integer: shuffle buffer size.
max_steps	train.max_steps	integer	None
eval steps	eval.steps	integer	None
start delay secs	eval.start_delay_secs	integer	120
throttle_secs	eval.throttle_secs	integer	600

COLUMN FOR

The expressions in COLUMN contains the Feature Columns information. There could be multiple COLUMN blocks in the SQL.

The value of FOR keyword represents which parameter of the constructor method the feature columns assigning to.

For example, the following SQL

...
TO TRAIN
	DNNLinearCombinedClassifier
WITH
	...
COLUMN
	DENSE(...) FOR linear_feature_columns
COLUMN
	BUCKET(...) FOR dnn_feature_columns
...

will be translated to

	estimator = tf.estimator. DNNLinearCombinedClassifier(
		linear_feature_columns = [tf.feature_column.numeric(...)],
		dnn_feature_columns = [tf.feature_column.bucket(...)],
		...
	)

They were not only translated into TensorFlow Feature Columns, but also contains the encoding information of data such as DENSE or SPARSE.

For example, the following SQL

SELECT 
...
COLUMN
	DENSE(c1, ...)
	SPARSE(c2, ...)

represents that the c1 field is the dense format and the c2 field is the sparse format.

Here is a list of supported expression:

expression	arguments of expression	example
dense	1. field name (str) 2. dense shape (list of integer) 3. separator (str)	dense(c1, [100, 200], comma)
sparse	1. field name (str) 2. sparse shape (integer) 3. separator (str)	sparse(c2, 10000, comma)
bucket	1. key (numeric) 2. bucket size (integer)	bucket(numeric(c1, 100), 20)
cat_id	1. field name (str) 2. bucket size (integer)	cat_id(c2, 10000)
embedding	1. key (cat_id) 2. dimension (integer) 3. combiner (str)	embedding(cat_id(c2, 10000), mean)

Independent Module for Resolving of Training Parameters

Since the format of training parameters has been unified, it’s better to have an independent module in SQLFlow to do the resolving according to the rules instead of doing it in each codegen_**.go.

The advantage of an independent module contains

Unifying resolving process to avoid differences in processing between codegen modules.
Fast fail during resolving before the code generation.

Here we propose a module such as resolving.go which take the trainClause as input and outputs a struct named resolvedTrainClause for all kinds of codegen_**.go to do code generation.

The struct named resolvedTrainClause looks like this

package sql

type resolvedTrainClause struct {
	IsPreMadeModel  bool
	ModelName   string
	ModelConstructorParameters map[string]interface{}
	BatchSize int
	DropRemainder bool
	EnableCache bool
	CachePath string
	Epoch int
	Shard int
	EnableShuffle bool
	ShuffleBufferSize int
	MaxSteps int
	EvalSteps int
	EvalStartDelay int
	EvalThrottle int
	FeatureColumns map[string][]featureColumn
	FeatureSpecs map[string][]featureSpec
}