Compatibility with Various SQL Engines

SQLFlow interacts with SQL engines like MySQL and Hive, while different SQL engines use variants of SQL syntax, it is important for SQLFlow to have an abstraction layer that hides such differences.

SQLFlow calls Go’s standard database API. The submitter programs generated by SQLFlow call Python’s database API. Both APIs abstract the interface to various SQL engines; however, they are insufficient for SQLFlow to work. In this document, we examine all interactions between SQLFlow and the SQL engine so to identify what SQLFlow authors have to abstract in addition to calling Go’s and Python’s database APIs.

Data Operations in Go

Data Retrieval

The basic idea of SQLFlow is to extend the SELECT statement of SQL to have the TRAIN and PREDICT clauses. For more discussion, please refer to the syntax design. SQLFlow translates such “extended SQL statements” into submitter programs, which forward the part from SELECT to TO TRAIN or TO PREDICT, which we call the “standard part”, to the SQL engine. SQLFlow also accepts the SELECT statement without TO TRAIN or TO PREDICT clauses and would forward such “standard statements” to the engine. It is noticeable that the “standard part” or “standard statements” are not standardized. For example, various engines use different syntax for FULL OUTER JOIN.

Hive supports FULL OUTER JOIN directly.
MySQL doesn’t have FULL OUTER JOIN. However, a user can emulates FULL OUTER JOIN using LEFT JOIN, UNION and RIGHT JOIN.

Fortunately, as SQLFlow forwards the above parts to the engine, it doesn’t have to care much about the differences above.

Metadata Retrieval

To verify the semantics of users’ inputs, SQLFlow needs to retrieve the schema of tables. For example, the input might be

SELECT 
      name,
      age,
      income 
FROM  employee TO TRAIN DNNRegressor 
WITH  hidden_layers=[10,50,10] 
COLUMN name, agee LABEL income;

In the above example, the user misspelled the field name age in the COLUMN clause as “agee”. SQLFlow must be able to find that out.

To do that, SQLFlow needs to query the field names from the SQL engine. However, different engines use various syntax. For example:

MySQL: DESCRIBE/DESC employee;
Hive: DESCRIBE FORMATTED employee;
ODPS: DESC employee;

The returned data format varies too. Our solution to avoid such differences is not-to-use-them; instead, SQLFlow retrieves the table schema by running a query like SELECT * FROM employee LIMIT 1; and inferring field types using the mechanism called DatabaseTypeName provided by SQL engines drivers beneath the Go’s standard database API.

Prepare Prediction Table

A SQLFlow prediction job writes its prediction results into a table. It prepares the prediction table by

Dropping previous prediction table DROP TABLE IF EXISTS my_table;
Creating table with schema CREATE TABLE my_table (name1, type1, name2 type2);

Most SQL engines, including MySQL, Hive, ODPS, support both statements.

Translate Database Column Type to TensorFlow Feature Column Type

After retrieving database column type name through DatabaseTypeName, we can derive TensorFlow’s feature column type via a mapping such as {"FLOAT", "DOUBLE"} -> tf.numeric_column.

Save Model

SQLFlow saves trained ML model by dumping the serialized the model directory into a table. It first creates a table by CREATE TABLE IF NOT EXISTS %s (id INT AUTO_INCREMENT, block BLOB, PRIMARY KEY (id)) and insert blobs by INSERT INTO %s (block) VALUES(?).

Note that Hive and ODPS doesn’t have BLOB type, we need to use BINARY (docs at ODPS, Hive) instead.

Also, note that Hive and ODPS doesn’t support AUTO_INCREMENT, we need to implemented auto increment logic in sqlfs.

Load Model

SQLFlow loads trained ML model by reading rows in a table and deserializing the blob to a model directory.

It reads rows by running SELECT block FROM %s ORDER BY id, which is supported by most databases.

Data Operations in Python

Connect to SQL Engines

Thanks to the Python database API, connecting to different databases follows a similar API.

conn = mysql.connector.connect(user='scott', password='password',
                               host='127.0.0.1',
                               database='employees')
conn = pyhive.connect('localhost')

cursor = conn.cursor()
cursor.execute('select * from my_table;')

Insert Prediction Result into Prediction Table

Python database API provides execute_many(sql, value) to insert multiple values at once. So one can prepare the following insertion statement. Please be aware that MySQL use INSERT INTO to insert rows while Hive and ODPS use INSERT INTO TABLE.

-- MySQL
INSERT INTO table_name VALUES (value1, value2, value3, ...);
-- Hive, ODPS
INSERT INTO TABLE table_name VALUES (value1, value2, value3, ...);