Skip to content
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 32 additions & 13 deletions docs/designs/data_transform.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,15 +168,18 @@ After normalizing the table schema, we can do data analysis and transformation o

We can extend the SQLFlow syntax and enrich the `COLUMN` expression. We propose to add some built-in functions to describe the transform process. We will implement common used functions at the first stage.

| Name | Transformation | Statitical Parameter |
|:-------------:|:-----------------------------------------------------:|:--------------------:|
| NORMALIZE | x - x_min / (x_max - x_min) | x_min, x_max |
| STANDARDIZE | x - x_mean / x_stddev | x_mean, x_stddev |
| LOG_ROUND | tf.round(tf.log(x)) | N/A |
| BUCKETIZE | tf.feature_column.bucketized_column | bucket_boundary |
| HASH | tf.feature_column.categorical_column_with_hash_bucket | hash_bucket_size |
| CROSS | tf.feature_column.crossed_column | N/A |
| EMBEDDING | tf.feature_column.embedding_column | N/A |
| Name | Transformation | Statitical Parameter | Input Type | Output Type |
|:----------------:|:-----------------------------------------------------:|:--------------------:|:-------------------:|:---------------------:|
| NORMALIZE(x) | Scale the inputs to the range [0, 1]. `out = x - x_min / (x_max - x_min)` | x_min, x_max | number | float64 |
| STANDARDIZE(x) | Scale the inputs to z-score subtracts out the mean and divides by standard deviation. `out = x - x_mean / x_stddev` | x_mean, x_stddev | number | float64 |
| BUCKETIZE(x, num_buckets, boundaries) | Transform the numeric features into categorical ids using a set of thresholds. | boundaries | Number | int64 |
| HASH_BUCKET(x, hash_bucket_size) | Map the inputs into a finite number of buckets by hashing. `out_id = Hash(input_feature) % bucket_size` | hash_bucket_size | string, int32, int64 | int64 |
| VOCABULARIZE(x) | Map the inputs to integer ids by looking up the vocabulary | vocabulary_list | string, int32, int64 | int64 |
| EMBEDDING(x, dimension) | Map the inputs to embedding vectors | N/A | int32, int64 | float32 |
| CROSS(x1, x2, ..., xn, hash_bucket_size) | Hash(cartesian product of features) % hash_bucket_size | N/A | string, number | int64 |
| CONCAT(x1, x2, ..., xn) | Concatenate multiple tensors representing categorical ids into one tensor. | N/A | int32, int64 | int64 |

*Please check more [discussion](https://github.com/sql-machine-learning/elasticdl/issues/1723) about `CONCAT` transform function*

Let's take the following SQLFlow statement for example.

Expand All @@ -185,13 +188,29 @@ SELECT *
FROM census_income
TO TRAIN DNNClassifier
WITH model.hidden_units = [10, 20]
COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age)), EMBEDDING(BUCKETIZE(hours_per_week, bucket_num=5), dim=32)
COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age)), EMBEDDING(BUCKETIZE(hours_per_week, num_buckets=5), dimension=32)
LABEL label
```

It trains a DNN model to classify someone's income level using the [census income dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income). The transform expression is `COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age)), EMBEDDING(BUCKETIZE(hours_per_week, bucket_num=5), dim=32)`. It will normalize the column *capital_gain*, standardize the column *age*, bucketize the column *hours_per_week* to 5 buckets and then map it to an embedding vector.
It trains a DNN model to classify someone's income level using the [census income dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income). The transform expression is `COLUMN NUMERIC(NORMALIZE(capital_gain)), NUMERIC(STANDARDIZE(age)), EMBEDDING(BUCKETIZE(hours_per_week, num_buckets=5), dimension=32)`. It will normalize the column *capital_gain*, standardize the column *age*, bucketize the column *hours_per_week* to 5 buckets and then map it to an embedding vector.

*Please check the [discussion](https://github.com/sql-machine-learning/elasticdl/issues/1664).*
Next, Let's see a more complicated scenario. The following SQL statment trains a [wide and deep model](https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html) using the same dataset.

```SQL
SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMN
EMBEDDING(CONCAT(VOCABULARIZE(workclass), BUCKETIZE(capital_gain, num_buckets=5), BUCKETIZE(capital_loss, num_buckets=5), BUCKTIZE(hours_per_week, num_buckets=6)) AS group_1, 8),
EMBEDDING(CONCAT(HASH(education), HASH(occupation), VOCABULARIZE(martial_status), VOCABULARIZE(relationship)) AS group_2, 8),
EMBEDDING(CONCAT(BUCKETIZE(age, num_buckets=5), HASH(native_country), VOCABULARIZE(race), VOCABULARIZE(sex)) AS group_3, 8)
FOR deep_embeddings
COLUMN
EMBEDDING(group1, 1),
EMBEDDING(group2, 1)
FOR wide_embeddings
LABEL label
```

SQLFlow will convert the `COLUMN` expression to Python code of data transformation. But it requires some parameters which are derived from the data. So next we will do the analysis work.

Expand Down Expand Up @@ -250,7 +269,7 @@ We plan to implement the following common used transform APIs at the first step.
| STANDARDIZE(x) | numeric_column({var_name}, normalizer_fn=lambda x : x - {mean} / {std}) | MEAN, STDDEV |
| NORMALIZE(x) | numeric_column({var_name}, normalizer_fn=lambda x : x - {min} / {max} - {min}) | MAX, MIN |
| LOG(x) | numeric_column({var_name}, normalizer_fn=lambda x : tf.math.log(x)) | N/A |
| BUCKETIZE(x, bucket_num=y) | bucketized_column({var_name}, boundaries={percentiles}) | PERCENTILE |
| BUCKETIZE(x, num_buckets=y) | bucketized_column({var_name}, boundaries={percentiles}) | PERCENTILE |

## Further Consideration

Expand Down