Customize Data Analysis on SQLFlow Tutorial

Data analysis can help us understand what is in the dataset and the characteristics of the data.

Data binning is a commonly-used data analysis technique. It groups continous values into a small number of discretized bins. We will get the distribution of the data from the binning result.

We can use SQLFlow TO RUN statement to call the SQLFlow runnable which is released in the form of Docker image. SQLFlow provides some premade runnables in sqlflow/runnable including the binning runnable. Please use the following SQL statement to do the data binning. All the table columns specified in the --column parameters will be bucketized to 10 bins.

%%sqlflow
SELECT * FROM creditcard.creditcard
TO RUN sqlflow/runnable:v0.0.1
CMD "binning.py",
    "--dbname=creditcard",
    "--columns=time,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,amount",
    "--bin_method=bucket",
    "--bin_num=10"
INTO creditcard_binning_result;

The result table contains the binning boundaries, proability distribution for each bin and also some common used statistical results.

%%sqlflow
SELECT * FROM creditcard.creditcard_binning_result LIMIT 10;

What’s more, we can also use two_dim_binning runnable to calculate the 2D distribution for the combination of two variables. In the following SQL statement, v1 will bucketized to 10 bins and v2 will be bucketized to 5 bins.

%%sqlflow
SELECT * FROM creditcard.creditcard
TO RUN sqlflow/runnable:v0.0.1
CMD "two_dim_binning.py",
    "--dbname=creditcard",
    "--columns=v1,v2",
    "--bin_method=bucket,bucket",
    "--bin_num=10,5"
INTO creditcard_stats_result,creditcard_two_dim_prob,creditcard_two_dim_binning_cumsum_prob;