-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
Hi,
When I use datasets with 600GB data, the dataloading speed increases significantly.
I am experimenting with two datasets, and one is about 60GB and the other 600GB.
Simply speaking, my code uses datasets.set_format("torch") function and let pytorch-lightning handle ddp training.
When looking at the pytorch-lightning supported profile of two different runs, I see that fetching a batch(get_train_batch) consumes an unreasonable amount of time when data is large. What could be the cause?
- 60GB data
Action | Mean duration (s) |Num calls | Total time (s) | Percentage % |
------------------------------------------------------------------------------------------------------------------------------------
Total | - |_ | 200.33 | 100 % |
------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch | 71.994 |1 | 71.994 | 35.937 |
run_training_batch | 0.64373 |100 | 64.373 | 32.133 |
optimizer_step_and_closure_0 | 0.64322 |100 | 64.322 | 32.108 |
training_step_and_backward | 0.61004 |100 | 61.004 | 30.452 |
model_backward | 0.37552 |100 | 37.552 | 18.745 |
model_forward | 0.22813 |100 | 22.813 | 11.387 |
training_step | 0.22759 |100 | 22.759 | 11.361 |
get_train_batch | 0.066385 |100 | 6.6385 | 3.3138 |
- 600GB data
Action | Mean duration (s) |Num calls | Total time (s) | Percentage % |
------------------------------------------------------------------------------------------------------------------------------------
Total | - |_ | 3285.6 | 100 % |
------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch | 1397.9 |1 | 1397.9 | 42.546 |
run_training_batch | 7.2596 |100 | 725.96 | 22.095 |
optimizer_step_and_closure_0 | 7.2589 |100 | 725.89 | 22.093 |
training_step_and_backward | 7.223 |100 | 722.3 | 21.984 |
model_backward | 6.9662 |100 | 696.62 | 21.202 |
get_train_batch | 6.322 |100 | 632.2 | 19.241 |
model_forward | 0.24902 |100 | 24.902 | 0.75789 |
training_step | 0.2485 |100 | 24.85 | 0.75633 |
Metadata
Metadata
Assignees
Labels
No labels