Transform the Column with Key-value Pairs String to Wide Table
This document describes the design to normalize (or transform) the table schema to be wide, aka one feature per column. The transformation is a part of data analysis and transform.
Motivation
Generally, the data developers parse logs to features and the number of features may vary. It is convenient to save features using key-value pair string as a column in a table. However, it is difficult to analyze features in the table using SQL syntax. So we need to transform the column to a wide table. Then we can analyze each features like calculate the mean and variance for the numeric feature. For example, the key-value column is
features | label |
---|---|
age:19,income:1200,education:Master | 1 |
age:23,income:4500,education:Doctor | 0 |
And we need to transform it to
age | income | education | label |
---|---|---|---|
19 | 1200 | Master | 1 |
23 | 4500 | Doctor | 0 |
Design Components
ODPS UDF to Transform Key-value Pairs String to Multiple Values
ODPS provides UDF for users to define transformation using Python. We choose the UDTF in UDF to implement the transformation because it can match any input parameters in SQL.
Here is a UDTF demo.
#coding:utf-8
from odps.udf import annotate
from odps.udf import BaseUDTF
class Explode(BaseUDTF):
def process(self, arg):
...
For the UDTF to transform key-value pairs, we can implement a UDTF with multiple arguments. The arguments in the UDTF likes
udtf_func(kv_column, append_columns, feature_names, inter_kv_separator, intra_kv_separator)
The “kv_column” is the column name with key-value pairs strings.
The “append_columns” is column names which are directly exported without any
transformation.
The “feature_names” is the a string with feature names separated by “,”.
The “inter_kv_separator” is the inter key-value pairs separator.
The “intra_kv_separator” is the intra key-value pairs separator.
For the table showed in the motivation section, The SQL with the UDTF is
SELECT udtf_func(features, label, "age, income, education", ",", ":") as (age, income, education, label) FROM input_table
The implementation of the UDF is
#coding:utf-8
from odps.udf import annotate
from odps.udf import BaseUDTF
class Explode(BaseUDTF):
def process(self, *args):
feature_names = args[-3].split(",")
inter_kv_sep = args[-2]
intra_kv_sep = args[-1]
feature_values = parse_kv_string_to_dict(
args[0],
feature_names,
inter_kv_sep,
intra_kv_sep
)
for value in args[1:-3]:
feature_values.append(str(value))
self.forward(*feature_values)
Infer the Feature Names and Generate the SQL to Transform
If the number of features in key-value pairs string is very large, it is tedious for users to write the feature names in the transformation SQL. Using Pyodps, we can infer all feature names from the records in the table and then generate the SQL.
By default, we download 100 records from the ODPS table by pyodps tunnel and then parse feature names from those records.
feature_names = set()
for record in table.head(100):
kv_dict = parse_kv_string_to_dict()
feature_names.update(kv_dict.keys())
After inferring the feature names, we can generate the SQL by the template:
TRANSFORM_SQL_TEMPLATE = "CREATE TABLE IF NOT EXISTS {output_table} LIFECYCLE 7 AS \n\
SELECT \n\
{udf} \n\
FROM {input_table}"
The template of “udf” in the SQL template is:
"""{udf_func}({input_col_str},
"{features_str}", "{inter_sep}", "{intra_sep}")
as ({output_col_str})
"""
Execute the Generated SQL by PyODPS
The steps to execute the generated SQL by PyODPS are:
- Create an ODPS resource using the UDTF python file.
- Create an ODPS function using the resource.
- Using PyODPS to submit the SQL to ODPS cluster.