SQLFlow interacts with SQL engines like MySQL and Hive, while different SQL engines use variants of SQL syntax, it is important for SQLFlow to have an abstraction layer that hides such differences.
SQLFlow calls Go’s standard database API. The submitter programs generated by SQLFlow call Python’s database API. Both APIs abstract the interface to various SQL engines; however, they are insufficient for SQLFlow to work. In this document, we examine all interactions between SQLFlow and the SQL engine so to identify what SQLFlow authors have to abstract in addition to calling Go’s and Python’s database APIs.
The basic idea of SQLFlow is to extend the SELECT statement of SQL to have the TRAIN and PREDICT clauses. For more discussion, please refer to the syntax design. SQLFlow translates such “extended SQL statements” into submitter programs, which forward the part from SELECT to TO TRAIN or TO PREDICT, which we call the “standard part”, to the SQL engine. SQLFlow also accepts the SELECT statement without TO TRAIN or TO PREDICT clauses and would forward such “standard statements” to the engine. It is noticeable that the “standard part” or “standard statements” are not standardized. For example, various engines use different syntax for
FULL OUTER JOIN.
- Hive supports
FULL OUTER JOINdirectly.
- MySQL doesn’t have
FULL OUTER JOIN. However, a user can emulates
FULL OUTER JOINusing
Fortunately, as SQLFlow forwards the above parts to the engine, it doesn’t have to care much about the differences above.
To verify the semantics of users’ inputs, SQLFlow needs to retrieve the schema of tables. For example, the input might be
SELECT name, age, income FROM employee TO TRAIN DNNRegressor WITH hidden_layers=[10,50,10] COLUMN name, agee LABEL income;
In the above example, the user misspelled the field name
age in the COLUMN clause as “agee”. SQLFlow must be able to find that out.
To do that, SQLFlow needs to query the field names from the SQL engine. However, different engines use various syntax. For example:
DESCRIBE FORMATTED employee;
The returned data format varies too. Our solution to avoid such differences is not-to-use-them; instead, SQLFlow retrieves the table schema by running a query like
SELECT * FROM employee LIMIT 1; and inferring field types using the mechanism called DatabaseTypeName provided by SQL engines drivers beneath the Go’s standard database API.
A SQLFlow prediction job writes its prediction results into a table. It prepares the prediction table by
- Dropping previous prediction table
DROP TABLE IF EXISTS my_table;
- Creating table with schema
CREATE TABLE my_table (name1, type1, name2 type2);
Most SQL engines, including MySQL, Hive, ODPS, support both statements.
SQLFlow saves trained ML model by dumping the serialized the model directory into a table. It first creates a table by
CREATE TABLE IF NOT EXISTS %s (id INT AUTO_INCREMENT, block BLOB, PRIMARY KEY (id)) and insert blobs by
INSERT INTO %s (block) VALUES(?).
Also, note that Hive and ODPS doesn’t support
AUTO_INCREMENT, we need to implemented auto increment logic in
SQLFlow loads trained ML model by reading rows in a table and deserializing the blob to a model directory.
It reads rows by running
SELECT block FROM %s ORDER BY id, which is supported by most databases.
Thanks to the Python database API, connecting to different databases follows a similar API.
conn = mysql.connector.connect(user='scott', password='password', host='127.0.0.1', database='employees') conn = pyhive.connect('localhost') cursor = conn.cursor() cursor.execute('select * from my_table;')
Python database API provides
execute_many(sql, value) to insert multiple values at once. So one can prepare the following insertion statement. Please be aware that MySQL use
INSERT INTO to insert rows while Hive and ODPS use
INSERT INTO TABLE.
-- MySQL INSERT INTO table_name VALUES (value1, value2, value3, ...); -- Hive, ODPS INSERT INTO TABLE table_name VALUES (value1, value2, value3, ...);