-
Notifications
You must be signed in to change notification settings - Fork 39
dcn on Criteo #335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
dcn on Criteo #335
Changes from 44 commits
Commits
Show all changes
56 commits
Select commit
Hold shift + click to select a range
07f6ed3
add dcn files.
jiangyzy 8882312
add README.md
jiangyzy e953d85
update readme.md, requirements.txt, train.sh. pretrained models cover…
jiangyzy d241dbf
deleted files
jiangyzy e2d76dd
deleted files
jiangyzy a632cb5
auto format by CI
oneflow-ci-bot 62b98d0
deleted .gitignore
jiangyzy 543a1bb
deleted .gitignore
jiangyzy 6e3a757
updated files
jiangyzy 68e8e49
modified nn.init.zeros_ and nn.init.xavier_normal_ in crossnet.
jiangyzy 06dfe97
fix change form /scripts/swin_dataloader_compare_speed_with_pytorch.py
jiangyzy c672dc4
add processing frappe from csv to parqurt format files: tools/frap…
jiangyzy 03c5763
modified frappe download link in README.md
jiangyzy d9dfd92
delete tools dir
jiangyzy 249b794
add tools dir
jiangyzy b87dbab
update dcn_graph_train_eval files
jiangyzy f14901b
Merge branch 'main' of https://github.com/Oneflow-Inc/models into dcn…
jiangyzy bc1d6d6
update fuxi dcn graph train and eval files , new dataset make tool ba…
jiangyzy 472645f
modified train.sh table_size_array
jiangyzy 91a88a4
fix some erroe in fuxi_data_util when save csv
jiangyzy 900cab4
Merge remote-tracking branch 'origin/dcn_fuxi_train_eval' into main
jiangyzy 09f6501
Criteo dcn related files
jiangyzy bb364fb
modified README.md
jiangyzy 6d477b8
modified dcn_train_eval.py some arguments name
jiangyzy 3c9201b
create graph when lr_decay
jiangyzy e987402
deleted fm_persistent
jiangyzy 2e50a5b
update dcn_train_eval.py
jiangyzy bd77d0f
formated file by
jiangyzy 329f789
new tool dir , and modified dcn_train_eval.py/sh fake path
jiangyzy a589398
add feature_map_json argment
jiangyzy 572d969
delete unnecessary and useless code
jiangyzy 24fe34d
add cast in make_criteo_parquet.py, modified dcn_train_eval.py
jiangyzy d29f5ee
delete useless
jiangyzy 6297862
add throughput
jiangyzy e14f763
Merge branch 'main' of https://github.com/Oneflow-Inc/models into cri…
jiangyzy b2dc7d6
add valid test samples arg
jiangyzy 9bae48b
fix batch_size and train_batch_size mismatched problem
jiangyzy 1960bd7
delete uesless print code
jiangyzy ed02f35
add a blank line in the bottom of dataset_config.yaml
jiangyzy 8bfb01b
add requirements.txt, update README.md
jiangyzy 5ea1f41
move loss=loss.numpy() to improve efficiency
jiangyzy 22a0477
delete fuxi code in dcn_train_eval.py, add scala related files, upda…
jiangyzy dd87ca6
update README
jiangyzy 062680b
remove RecommenderSystems/dcn/tools/make_criteo_parquet.py and Recom…
jiangyzy 3adeb3b
simplified DNN module, modified test eval process and related READEM…
jiangyzy cc9b8c4
add Crossnet fuxi quote, modified directory description in Readme an…
jiangyzy 1cef2f6
name auc loglogg in eval process as val_auc val_logloss, add pandas …
jiangyzy 314831f
simplified train.sh and related README contents
jiangyzy 2bb71e4
simplified L2,3,4 in train.sh
jiangyzy b2574fb
set size_factor default=3
jiangyzy fffafce
add dcn structure image
jiangyzy 60f8505
update Crossnet implementation in README
jiangyzy b1f9403
update Crossnet implementation in README
jiangyzy 104127f
update Crossnet implementation in README
jiangyzy 8eac6b3
update Crossnet implementation in README
jiangyzy ee21320
update README
jiangyzy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,167 @@ | ||
| # Deep&Cross | ||
| [Deep & Cross Network](https://dl.acm.org/doi/10.1145/3124749.3124754) (DCN) can not only keep the advantages of DNN model, but also learn specific bounded feature crossover more effectively. In particular, DCN can explicitly learn cross features for each layer without the need for manual feature engineering, and the increased algorithm complexity is almost negligible compared with DNN model. | ||
|
|
||
|
|
||
| ## Directory description | ||
| ``` | ||
| . | ||
| |-- tools | ||
| |-- dataset_config.yaml # dataset config file | ||
| |-- split_criteo.py # split Criteo txt file to csv files | ||
| |-- make_criteo_parquet.py # make Criteo parquet data from csv files | ||
| |-- dcn_train_eval.py # OneFlow DCN training and evaluation scripts with OneEmbedding module | ||
| |-- train.sh # command to train DCN | ||
| |-- requirements.txt # python package configuration file | ||
| └── README.md # Documentation | ||
| ``` | ||
|
|
||
|
|
||
| ## Arguments description | ||
| |Argument Name|Argument Explanation|Default Value| | ||
| |-----|---|------| | ||
| |data_dir|the data file directory|*Required Argument*| | ||
| |num_train_samples|the number of training samples|36672493| | ||
| |num_valid_samples|the number of validation samples|4584062| | ||
| |num_test_samples|the number of test samples|4584062| | ||
| |shard_seed|seed for shuffling parquet data|2022| | ||
| |model_load_dir|model loading directory|None| | ||
| |model_save_dir|model saving directory|None| | ||
| |save_initial_model|save initial model parameters or not|False| | ||
| |save_model_after_each_eval|save model or not after each evaluation|False| | ||
| |embedding_vec_size|embedding vector dimention size|128| | ||
| |batch_norm|batch norm used in DNN|False| | ||
| |dnn_hidden_units|hidden units list of DNN|"1000,1000,1000,1000,1000"| | ||
| |crossing_layers|layer number of Crossnet|3| | ||
| |net_dropout|dropout rate of DNN|0.2| | ||
| |embedding_regularizer|rate of embedding layer regularizer|None| | ||
| |net_regularizer|rate of Crossnet and DNN layer regularizer|None| | ||
| |patience|waiting epoch of ealy stopping|2| | ||
| |min_delta|minimal delta of metric Monitor|1.0e-6| | ||
| |lr_factor|learning rate decay factor|0.1| | ||
| |min_lr|minimal learning rate|1.0e-6| | ||
| |learning_rate|learning rate|0.001| | ||
| |size_factor|size factor of OneEmbedding|3| | ||
| |valid_batch_size|valid batch size|10000| | ||
| |valid_batches|number of valid batches|1000| | ||
| |test_batch_size|test batch size|10000| | ||
| |test_batches|number of test batches|1000| | ||
| |train_batch_size|train batch size|10000| | ||
| |train_batches|number of train batches|15000| | ||
| |loss_print_interval|training loss print interval|1000| | ||
| |train_batch_size|training batch size|55296| | ||
| |train_batches|number of minibatch training interations|75000| | ||
| |table_size_array|table size array for sparse fields|*Required Argument*| | ||
| |persistent_path|path for OneEmbedding persistent kv store|*Required Argument*| | ||
| |store_type|OneEmbeddig persistent kv store type: `device_mem`, `cached_host_mem` or `cached_ssd` |cached_ssd| | ||
| |cache_memory_budget_mb|size of cache memory budget on each device in megabytes when `store_type` is `cached_host_mem` or `cached_ssd`|8192| | ||
| |amp|enable Automatic Mixed Precision(AMP) training|False| | ||
| |loss_scale_policy|loss scale policy for AMP training: `static` or `dynamic`|static| | ||
|
|
||
| ## Getting started | ||
| If you'd like to quickly train a OneFlow DCN model, please follow steps below: | ||
| ### Installing OneFlow and Dependencies | ||
| To install nightly release of OneFlow with CUDA 11.5 support: | ||
| ``` | ||
| python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/cu115 | ||
| ``` | ||
| For more information how to install Oneflow, please refer to [Oneflow Installation Tutorial]( | ||
| https://github.com/Oneflow-Inc/oneflow#install-oneflow). | ||
|
|
||
| Please check `requirements.txt` to install dependencies manually or execute: | ||
| ```bash | ||
| python3 -m pip install -r requirements.txt | ||
| ``` | ||
| ### Preparing dataset | ||
| The Criteo dataset is from [2014-kaggle-display-advertising-challenge-dataset](https://www.kaggle.com/competitions/criteo-display-ad-challenge/overview), considered the original download link is invalid, click [here](https://www.kaggle.com/datasets/mrkmakr/criteo-dataset) to donwload if you would. | ||
|
|
||
| Each sample contains: | ||
| - Label - Target variable that indicates if an ad was clicked (1) or not (0). | ||
| - I1-I13 - A total of 13 columns of integer features (mostly count features). | ||
| - C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes. | ||
|
|
||
|
|
||
| ### Data preprocess: | ||
| 1. Download the [Criteo Kaggle dataset](https://www.kaggle.com/c/criteo-display-ad-challenge) and then split it using split_criteo_kaggle.py. | ||
|
|
||
| 2. launch a spark shell using [launch_spark.sh](https://github.com/Oneflow-Inc/models/blob/criteo_dcn/RecommenderSystems/dcn/tools/launch_spark.sh). | ||
|
|
||
| - Modify the SPARK_LOCAL_DIRS as needed | ||
|
|
||
| ```shell | ||
| export SPARK_LOCAL_DIRS=/path/to/your/spark/ | ||
| ``` | ||
|
|
||
| - Run `bash launch_spark.sh` | ||
|
|
||
| 3. load [ddn_parquet.scala](https://github.com/Oneflow-Inc/models/blob/criteo_dcn/RecommenderSystems/dcn/tools/dcn_parquet.scala) to your spark shell by `:load ddn_parquet.scala`. | ||
ShawnXuan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 4. call the `makeDCNDataset(srcDir: String, dstDir:String)` function to generate the dataset. | ||
|
|
||
| ```shell | ||
| makeDCNDataset("/path/to/your/src_dir", "/path/to/your/dst_dir") | ||
| ``` | ||
|
|
||
| After generating parquet dataset, dataset information will also be printed. It contains the information about the number of samples and table size array, which is needed when training. | ||
|
|
||
| ```txt | ||
| train samples = 36672493 | ||
| validation samples = 4584062 | ||
| test samples = 4584062 | ||
| table size array: | ||
| 649,9364,14746,490,476707,11618,4142,1373,7275,13,169,407,1376 | ||
| 1460,583,10131227,2202608,305,24,12517,633,3,93145,5683,8351593,3194,27,14992,5461306,10,5652,2173,4,7046547,18,15,286181,105,142572 | ||
| ``` | ||
|
|
||
|
|
||
| ## Start training by Oneflow | ||
| Following command will launch 8 oneflow DCN training and evaluation processes on a node with 8 GPU devices, by specify `data_dir` for data input and `persistent_path` for OneEmbedding persistent store path. | ||
|
|
||
| `table_size_array` is close related to sparse features of data input. each sparse field such as `C1` or other `C*` field in criteo dataset corresponds to a embedding table and has its own capacity of unique feature ids, this capacity is also called `number of rows` or `size of embedding table`, the embedding table will be initialized by this value. `table_size_array` holds all sparse fields' `size of embedding table`. `table_size_array` is also used to estimate capacity for OneEmbedding. | ||
|
|
||
| ```python | ||
| DEVICE_NUM_PER_NODE=1 | ||
| NUM_NODES=1 | ||
| NODE_RANK=0 | ||
| MASTER_ADDR=127.0.0.1 | ||
| DATA_DIR=your_path/criteo_parquet | ||
| PERSISTENT_PATH=your_path/persistent1 | ||
| MODEL_SAVE_DIR=your_path/model_save_dir | ||
|
|
||
| python3 -m oneflow.distributed.launch \ | ||
| --nproc_per_node $DEVICE_NUM_PER_NODE \ | ||
| --nnodes $NUM_NODES \ | ||
| --node_rank $NODE_RANK \ | ||
| --master_addr $MASTER_ADDR \ | ||
| dcn_train_eval.py \ | ||
| --data_dir $DATA_DIR \ | ||
| --model_save_dir $MODEL_SAVE_DIR \ | ||
| --persistent_path $PERSISTENT_PATH \ | ||
| --save_initial_model \ | ||
| --save_model_after_each_eval \ | ||
| --table_size_array "43, 98, 121, 41, 219, 112, 79, 68, 91, 5, 26, 36, 70, 1447, 554, 157461, 117683, 305, 17, 11878, 629, 4, 39504, 5128, 156729, 3175, 27, 11070, 149083, 11, 4542, 1996, 4, 154737, 17, 16, 52989, 81, 40882" \ | ||
| --store_type 'cached_host_mem' \ | ||
| --dnn_hidden_units "1000, 1000, 1000, 1000, 1000" \ | ||
| --crossing_layers 4\ | ||
| --net_dropout 0.2 \ | ||
| --learning_rate 0.001 \ | ||
| --embedding_vec_size 16 \ | ||
| --cache_memory_budget_mb 2048 \ | ||
| --num_train_samples 36672493 \ | ||
| --num_valid_samples 4584062 \ | ||
| --num_test_samples 4584062 \ | ||
| --train_batch_size 10000 \ | ||
| --train_batches 70000 \ | ||
| --loss_print_interval 100 \ | ||
| --valid_batch_size 10000 \ | ||
| --valid_batches 1000 \ | ||
| --test_batch_size 10000 \ | ||
| --test_batches 1000 \ | ||
| --loss_print_interval 100 \ | ||
| --size_factor 3 | ||
|
|
||
| ``` | ||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.