Model Metadata Storage

Model metadata in SQLFlow is a piece of data which describes how the model is defined, trained and stored. It includes the original train data selection statement, the estimator and itshyper-parameters , the test set performance and so on. This article describe the data structure of the metadata and how we save and load the metadata to/from kinds of sqlfs.

Background

SQLFlow models can be saved in MySQL, Hive, OSS and other places. While the model can be trained side by side to the step docker image, it can also be trained remote to the step image (as a job on third-party platform). As a result, we do not have a unified way to store the model currently. As for the default submitter which may use MySQL or Hive as it data source, we store the model with some metadata into a zipped file, and finally store the file into a database. In this case, there is only one metadata got saved, that is the TrainSelect SQL statement. As to pai submitter, we store the model to OSS with more metadata, such as the Estimator and the FeatureColumnNames.

Some time ago, we have implemented the SHOW TRAIN statement, which displays the metadata to user. Recently, we are developing the Model Zoo. When releasing a trained model to Model Zoo, we are requested to send the metadata along with the model. Both features require us to enrich the stored metadata, unify its data structure and make it easy to use.

The Design

First, we do not save the model metadata from the step go code any more. Because the real training work may be remote to this image. We move the saving work to the python code which is doing the real training. A file named model_meta.json is dedicated to store the metadata. Basically, we can serialize all fields in Train ir to the file. Additionally, the evaluation result will be stored if it exists.

{
  "OriginalSQL": "SELECT * from train to TRAIN ...",
  "Estimator": "DNNClassifier",
  "Attributes":{"model.hidden_units":[10,10],"model.n_classes":3,"train.batch_size":1},
  "FeatureColumns":{"feature_columns":[{"FieldDesc":{"name":"sepal_length","dtype":1,"delimiter":"","shape":[1],"is_sparse":false,"vocabulary":null,"MaxID":0}}]},
  "FieldDescs": {"sepal_length": {"name":"sepal_length","dtype":1,"delimiter":"","shape":[1],"is_sparse":false,"vocabulary":null,"MaxID":0}},
  "EvaluationResult": "{auc:0.88}"
  ...
}

Then the model_meta.json is zipped together with the model data and stored into the database. As an exception, we do not zip the file on OSS storage, we just put the file together this the model data in a directory. In both cases, the directory structure of trained model is like below:

model_dir
  |_ model_meta.json
  |_ model_data
  |_ other files

When releasing a trained model to Model Zoo, we can dump the zipped model dir to local file system. Then extract the metadata using the command:

tar -xvpf model.tar.gz model_meta.json

Other use-cases like the SHOW TRAIN statement can follow this way to extract model metadata too.

Implementation Action

First, we implement this feature for default submitter which store the model in data storage like MySQL, Hive or maxcompute. Then we implement the feature on OSS storage which is not really a database.