This tutorial describes how to train a Clustering model using the Active Power Consumption dataset.
The Clustering model is designed to support the unsupervised learning model on SQLFlow. In this tutorial, you will learn how to:
- Train a clustering model based on deep embedding neural network on active power consumption dataset;
- Predict the patterns of the unlabeled data using the trained model.
- Clustering the date pairs according to the characteristics of power consumption. We can identify the difference in different categories.
The preprocessed data contains 50 fields. The first field is the date column, and the last field is the pattern to be predicted. The other fields are power consumption data at different times. The time interval is half an hour, and each power consumption data is a numeric feature.
Here are the column descriptions of the dataset:
|m*||The amount of power consumed in different periods every day. The time interval is half an hour. For example, m1 represents the amount of power consumed from 00:00:00 to 00:30:00.|
|class||The result group after clustering.|
We can have a quick peek of the data by running the following standard SQL statements.
%%sqlflow describe activepower.train;
%%sqlflow select * from activepower.train limit 1;
Let’s train a clustering model.
To specify the data to be clustered, we use standard SQL statements like
SELECT * FROM activepower.train;
We can specify the training configurations in the WITH clause. For example, we can set the number of clustering categories to 3, pre-train epochs to 10, etc..
TO TRAIN DeepEmbeddingClusterModel WITH model.n_clusters=3, model.pretrain_epochs=10, model.train_max_iters=800, model.pretrain_lr=1, model.train_lr=0.01, train.batch_size=256
We specify the columns for clustering in the COLUMN clause and the model table name in the INTO clause.
COLUMN m1,m2,m3,m4,m5,m6,m7,m8,m9,m10... INTO sqlflow_models.my_customized_model
Putting it all together, the following is the SQLFlow TO TRAIN statement of this clustering task. You can run it in the cell:
%%sqlflow SELECT * FROM activepower.train TO TRAIN sqlflow_models.DeepEmbeddingClusterModel WITH model.n_clusters=3, model.pretrain_epochs=10, model.train_max_iters=800, model.train_lr=0.01, model.pretrain_lr=1, train.batch_size=256 COLUMN m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12,m13,m14,m15,m16,m17,m18,m19,m20,m21,m22,m23,m24,m25,m26,m27,m28,m29,m30,m31,m32,m33,m34,m35,m36,m37,m38,m39,m40,m41,m42,m43,m44,m45,m46,m47,m48 INTO sqlflow_models.my_customized_model;
After training the clustering model, let’s predict the patterns of the train data using the trained model.
Firstly, we can use a standard SQL to fetch the prediction data:
SELECT * FROM activepower.train.
Next, we can specify the prediction result table by TO PREDICT clause:
TO PREDICT activepower.predict.class
Then, we can specify the trained model by USING clause:
Putting it all together, the following is the SQLFLow Prediction statement:
%%sqlflow SELECT * FROM activepower.train TO PREDICT activepower.predict.class USING sqlflow_models.my_customized_model;
Let’s have a glance at prediction results.
%%sqlflow SELECT * FROM activepower.predict limit 5;
We can use the SQL statement to explore the number of each category after clustering.
%%sqlflow select class, count(*) as count from activepower.predict group by class;
To explore the differences in power-consuming features among clustered categories, we can use SQL statements to aggregate the average power data at different o’clock.
%%sqlflow select class ,avg(m1) as clock0 ,avg(m8) as clock4 ,avg(m16) as clock8 ,avg(m24) as clock12 ,avg(m32) as clock16 ,avg(m40) as clock20 ,avg(m46) as clock23 from activepower.predict group by class;