From 7e505b52850fd32feea0bcad8696494048b24e9a Mon Sep 17 00:00:00 2001 From: "mingliang.gml" Date: Wed, 12 Feb 2020 07:19:32 +0800 Subject: [PATCH 1/6] For the COLUMN clause syntax part, add an example for wide and deep model. --- docs/designs/data_transform.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/docs/designs/data_transform.md b/docs/designs/data_transform.md index 9b51ae03b..cf4b701dd 100644 --- a/docs/designs/data_transform.md +++ b/docs/designs/data_transform.md @@ -191,6 +191,24 @@ LABEL label It trains a DNN model to classify someone's income level using the [census income dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income). The transform expression is `COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age)), EMBEDDING(BUCKETIZE(hours_per_week, bucket_num=5), dim=32)`. It will normalize the column *capital_gain*, standardize the column *age*, bucketize the column *hours_per_week* to 5 buckets and then map it to an embedding vector. +Next, Let's see a more complicated scenario. The following SQL statment trains a [wide and deep model](https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html) using the same dataset. + +```SQL +SELECT * +FROM census_income +TO TRAIN WideAndDeepClassifier +COLUMN + EMBEDDING(CONCAT(VOCABULARIZE(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1, 8), + EMBEDDING(CONCAT(HASH(education), HASH(occupation), VOCABULARIZE(martial_status), VOCABULARIZE(relationship)) AS group_2, 8), + EMBEDDING(CONCAT(BUCKETIZE(age, bucket_num=5), HASH(native_country), VOCABULARIZE(race), VOCABULARIZE(sex)) AS group_3, 8) + FOR deep_embeddings +COLUMN + EMBEDDING(group1, 1), + EMBEDDING(group2, 1) + FOR wide_embeddings +LABEL label +``` + *Please check the [discussion](https://github.com/sql-machine-learning/elasticdl/issues/1664).* SQLFlow will convert the `COLUMN` expression to Python code of data transformation. But it requires some parameters which are derived from the data. So next we will do the analysis work. From 0d2bd1b34213bfe748c841db58244402f8a7133d Mon Sep 17 00:00:00 2001 From: "mingliang.gml" Date: Wed, 12 Feb 2020 19:59:34 +0800 Subject: [PATCH 2/6] Add the Transform function api design --- docs/designs/data_transform.md | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/docs/designs/data_transform.md b/docs/designs/data_transform.md index cf4b701dd..226b1cc31 100644 --- a/docs/designs/data_transform.md +++ b/docs/designs/data_transform.md @@ -168,15 +168,18 @@ After normalizing the table schema, we can do data analysis and transformation o We can extend the SQLFlow syntax and enrich the `COLUMN` expression. We propose to add some built-in functions to describe the transform process. We will implement common used functions at the first stage. -| Name | Transformation | Statitical Parameter | -|:-------------:|:-----------------------------------------------------:|:--------------------:| -| NORMALIZE | x - x_min / (x_max - x_min) | x_min, x_max | -| STANDARDIZE | x - x_mean / x_stddev | x_mean, x_stddev | -| LOG_ROUND | tf.round(tf.log(x)) | N/A | -| BUCKETIZE | tf.feature_column.bucketized_column | bucket_boundary | -| HASH | tf.feature_column.categorical_column_with_hash_bucket | hash_bucket_size | -| CROSS | tf.feature_column.crossed_column | N/A | -| EMBEDDING | tf.feature_column.embedding_column | N/A | +| Name | Transformation | Statitical Parameter | Input Type | Output Type | +|:----------------:|:-----------------------------------------------------:|:--------------------:|:-------------------:|:---------------------:| +| NORMALIZE(x) | Scale to the range [0, 1]. `out = x - x_min / (x_max - x_min)` | x_min, x_max | number | float64 | +| STANDARDIZE(x) | Scale to z-score subtracts out the mean and divides by standard deviation. `out = x - x_mean / x_stddev` | x_mean, x_stddev | number | float64 | +| BUCKETIZE(x, num_buckets, boundaries) | Transform the numeric features into categorical ids using a set of thresholds. | boundaries | Number | int64 | +| HASH_BUCKET(x, hash_bucket_size) | Map the inputs into a finite number of buckets by hashing. `out_id = Hash(input_feature) % bucket_size` | hash_bucket_size | string, int32, int64 | int64 | +| VOCABULARIZE(x) | Map the input to integer ids by looking up the vocabulary | vocabulary_list | string, int32, int64 | int64 | +| EMBEDDING(x, dimension) | Map the input to embedding vectors | N/A | int32, int64 | float32 | +| CROSS(x1, x2, ..., xn, hash_bucket_size) | Hash(cartesian product of features) % hash_bucket_size | N/A | string, number | int64 | +| CONCAT(x1, x2, ..., xn) | Concatenate multiple tensors representing categorical ids into one tensor. | N/A | int32, int64 | int64 | + +*Please check more [discussion](https://github.com/sql-machine-learning/elasticdl/issues/1723) about `CONCAT` transform function* Let's take the following SQLFlow statement for example. @@ -185,11 +188,11 @@ SELECT * FROM census_income TO TRAIN DNNClassifier WITH model.hidden_units = [10, 20] -COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age)), EMBEDDING(BUCKETIZE(hours_per_week, bucket_num=5), dim=32) +COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age)), EMBEDDING(BUCKETIZE(hours_per_week, num_buckets=5), dimension=32) LABEL label ``` -It trains a DNN model to classify someone's income level using the [census income dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income). The transform expression is `COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age)), EMBEDDING(BUCKETIZE(hours_per_week, bucket_num=5), dim=32)`. It will normalize the column *capital_gain*, standardize the column *age*, bucketize the column *hours_per_week* to 5 buckets and then map it to an embedding vector. +It trains a DNN model to classify someone's income level using the [census income dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income). The transform expression is `COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age)), EMBEDDING(BUCKETIZE(hours_per_week, num_buckets=5), dimension=32)`. It will normalize the column *capital_gain*, standardize the column *age*, bucketize the column *hours_per_week* to 5 buckets and then map it to an embedding vector. Next, Let's see a more complicated scenario. The following SQL statment trains a [wide and deep model](https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html) using the same dataset. @@ -198,9 +201,9 @@ SELECT * FROM census_income TO TRAIN WideAndDeepClassifier COLUMN - EMBEDDING(CONCAT(VOCABULARIZE(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1, 8), + EMBEDDING(CONCAT(VOCABULARIZE(workclass), BUCKETIZE(capital_gain, num_buckets=5), BUCKETIZE(capital_loss, num_buckets=5), BUCKTIZE(hours_per_week, num_buckets=6)) AS group_1, 8), EMBEDDING(CONCAT(HASH(education), HASH(occupation), VOCABULARIZE(martial_status), VOCABULARIZE(relationship)) AS group_2, 8), - EMBEDDING(CONCAT(BUCKETIZE(age, bucket_num=5), HASH(native_country), VOCABULARIZE(race), VOCABULARIZE(sex)) AS group_3, 8) + EMBEDDING(CONCAT(BUCKETIZE(age, num_buckets=5), HASH(native_country), VOCABULARIZE(race), VOCABULARIZE(sex)) AS group_3, 8) FOR deep_embeddings COLUMN EMBEDDING(group1, 1), @@ -209,8 +212,6 @@ COLUMN LABEL label ``` -*Please check the [discussion](https://github.com/sql-machine-learning/elasticdl/issues/1664).* - SQLFlow will convert the `COLUMN` expression to Python code of data transformation. But it requires some parameters which are derived from the data. So next we will do the analysis work. ### Generate Analysis SQL From SQLFlow Statement @@ -268,7 +269,7 @@ We plan to implement the following common used transform APIs at the first step. | STANDARDIZE(x) | numeric_column({var_name}, normalizer_fn=lambda x : x - {mean} / {std}) | MEAN, STDDEV | | NORMALIZE(x) | numeric_column({var_name}, normalizer_fn=lambda x : x - {min} / {max} - {min}) | MAX, MIN | | LOG(x) | numeric_column({var_name}, normalizer_fn=lambda x : tf.math.log(x)) | N/A | -| BUCKETIZE(x, bucket_num=y) | bucketized_column({var_name}, boundaries={percentiles}) | PERCENTILE | +| BUCKETIZE(x, num_buckets=y) | bucketized_column({var_name}, boundaries={percentiles}) | PERCENTILE | ## Further Consideration From f70601ad7523abd5f6a6a33a5935f87b917b6a69 Mon Sep 17 00:00:00 2001 From: "mingliang.gml" Date: Wed, 12 Feb 2020 20:04:41 +0800 Subject: [PATCH 3/6] Do some rephrase --- docs/designs/data_transform.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/designs/data_transform.md b/docs/designs/data_transform.md index 226b1cc31..b6822dbe0 100644 --- a/docs/designs/data_transform.md +++ b/docs/designs/data_transform.md @@ -170,12 +170,12 @@ We can extend the SQLFlow syntax and enrich the `COLUMN` expression. We propose | Name | Transformation | Statitical Parameter | Input Type | Output Type | |:----------------:|:-----------------------------------------------------:|:--------------------:|:-------------------:|:---------------------:| -| NORMALIZE(x) | Scale to the range [0, 1]. `out = x - x_min / (x_max - x_min)` | x_min, x_max | number | float64 | -| STANDARDIZE(x) | Scale to z-score subtracts out the mean and divides by standard deviation. `out = x - x_mean / x_stddev` | x_mean, x_stddev | number | float64 | +| NORMALIZE(x) | Scale the inputs to the range [0, 1]. `out = x - x_min / (x_max - x_min)` | x_min, x_max | number | float64 | +| STANDARDIZE(x) | Scale the inputs to z-score subtracts out the mean and divides by standard deviation. `out = x - x_mean / x_stddev` | x_mean, x_stddev | number | float64 | | BUCKETIZE(x, num_buckets, boundaries) | Transform the numeric features into categorical ids using a set of thresholds. | boundaries | Number | int64 | | HASH_BUCKET(x, hash_bucket_size) | Map the inputs into a finite number of buckets by hashing. `out_id = Hash(input_feature) % bucket_size` | hash_bucket_size | string, int32, int64 | int64 | -| VOCABULARIZE(x) | Map the input to integer ids by looking up the vocabulary | vocabulary_list | string, int32, int64 | int64 | -| EMBEDDING(x, dimension) | Map the input to embedding vectors | N/A | int32, int64 | float32 | +| VOCABULARIZE(x) | Map the inputs to integer ids by looking up the vocabulary | vocabulary_list | string, int32, int64 | int64 | +| EMBEDDING(x, dimension) | Map the inputs to embedding vectors | N/A | int32, int64 | float32 | | CROSS(x1, x2, ..., xn, hash_bucket_size) | Hash(cartesian product of features) % hash_bucket_size | N/A | string, number | int64 | | CONCAT(x1, x2, ..., xn) | Concatenate multiple tensors representing categorical ids into one tensor. | N/A | int32, int64 | int64 | From 23a8dfa36b731087ecf5110c28d3e42e7d9417e2 Mon Sep 17 00:00:00 2001 From: "mingliang.gml" Date: Wed, 12 Feb 2020 20:33:29 +0800 Subject: [PATCH 4/6] Do some rephrase --- docs/designs/data_transform.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/designs/data_transform.md b/docs/designs/data_transform.md index b6822dbe0..363dd8859 100644 --- a/docs/designs/data_transform.md +++ b/docs/designs/data_transform.md @@ -269,7 +269,7 @@ We plan to implement the following common used transform APIs at the first step. | STANDARDIZE(x) | numeric_column({var_name}, normalizer_fn=lambda x : x - {mean} / {std}) | MEAN, STDDEV | | NORMALIZE(x) | numeric_column({var_name}, normalizer_fn=lambda x : x - {min} / {max} - {min}) | MAX, MIN | | LOG(x) | numeric_column({var_name}, normalizer_fn=lambda x : tf.math.log(x)) | N/A | -| BUCKETIZE(x, num_buckets=y) | bucketized_column({var_name}, boundaries={percentiles}) | PERCENTILE | +| BUCKETIZE(x, num_buckets=y) | bucketized_column({var_name}, boundaries={percentiles}) | PERCENTILE | ## Further Consideration From 373442b8a3b41fa2773599bc59095f062da7acd9 Mon Sep 17 00:00:00 2001 From: "mingliang.gml" Date: Thu, 13 Feb 2020 14:44:45 +0800 Subject: [PATCH 5/6] Update the syntax example according to the discussion in issue#1664 --- docs/designs/data_transform.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/designs/data_transform.md b/docs/designs/data_transform.md index 363dd8859..6bc21ec4c 100644 --- a/docs/designs/data_transform.md +++ b/docs/designs/data_transform.md @@ -188,11 +188,11 @@ SELECT * FROM census_income TO TRAIN DNNClassifier WITH model.hidden_units = [10, 20] -COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age)), EMBEDDING(BUCKETIZE(hours_per_week, num_buckets=5), dimension=32) +COLUMN NORMALIZE(capital_gain), STANDARDIZE(age), EMBEDDING(hours_per_week, dimension=32) LABEL label ``` -It trains a DNN model to classify someone's income level using the [census income dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income). The transform expression is `COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age)), EMBEDDING(BUCKETIZE(hours_per_week, num_buckets=5), dimension=32)`. It will normalize the column *capital_gain*, standardize the column *age*, bucketize the column *hours_per_week* to 5 buckets and then map it to an embedding vector. +It trains a DNN model to classify someone's income level using the [census income dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income). The transform expression is `COLUMN NORMALIZE(capital_gain), STANDARDIZE(age), EMBEDDING(hours_per_week, dimension=32)`. It will normalize the column *capital_gain*, standardize the column *age*, and then map *hours_per_week* to an embedding vector. Next, Let's see a more complicated scenario. The following SQL statment trains a [wide and deep model](https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html) using the same dataset. @@ -216,7 +216,7 @@ SQLFlow will convert the `COLUMN` expression to Python code of data transformati ### Generate Analysis SQL From SQLFlow Statement -SQLFlow will generate the analysis SQL to calculate the statistical value. For this clause `COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age))`, the corresponding analysis SQL is as follows: +SQLFlow will generate the analysis SQL to calculate the statistical value. For this clause `COLUMN NORMALIZE(capital_gain), STANDARDIZE(age)`, the corresponding analysis SQL is as follows: ```SQL SELECT From 6bda052eba992309edd5c99a0411e1cab4a6c5a1 Mon Sep 17 00:00:00 2001 From: "mingliang.gml" Date: Thu, 13 Feb 2020 14:53:35 +0800 Subject: [PATCH 6/6] Do some rephrase --- docs/designs/data_transform.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/designs/data_transform.md b/docs/designs/data_transform.md index 6bc21ec4c..cc5484d51 100644 --- a/docs/designs/data_transform.md +++ b/docs/designs/data_transform.md @@ -170,9 +170,9 @@ We can extend the SQLFlow syntax and enrich the `COLUMN` expression. We propose | Name | Transformation | Statitical Parameter | Input Type | Output Type | |:----------------:|:-----------------------------------------------------:|:--------------------:|:-------------------:|:---------------------:| -| NORMALIZE(x) | Scale the inputs to the range [0, 1]. `out = x - x_min / (x_max - x_min)` | x_min, x_max | number | float64 | +| NORMALIZE(x) | Scale the inputs to the range [0, 1]. `out = x - x_min / (x_max - x_min)` | x_min, x_max | number (int, float) | float64 | | STANDARDIZE(x) | Scale the inputs to z-score subtracts out the mean and divides by standard deviation. `out = x - x_mean / x_stddev` | x_mean, x_stddev | number | float64 | -| BUCKETIZE(x, num_buckets, boundaries) | Transform the numeric features into categorical ids using a set of thresholds. | boundaries | Number | int64 | +| BUCKETIZE(x, num_buckets, boundaries) | Transform the numeric features into categorical ids using a set of thresholds. | boundaries | number | int64 | | HASH_BUCKET(x, hash_bucket_size) | Map the inputs into a finite number of buckets by hashing. `out_id = Hash(input_feature) % bucket_size` | hash_bucket_size | string, int32, int64 | int64 | | VOCABULARIZE(x) | Map the inputs to integer ids by looking up the vocabulary | vocabulary_list | string, int32, int64 | int64 | | EMBEDDING(x, dimension) | Map the inputs to embedding vectors | N/A | int32, int64 | float32 |