Skip to content

dataloading slow when using HUGE dataset #2210

@hwijeen

Description

@hwijeen

Hi,

When I use datasets with 600GB data, the dataloading speed increases significantly.
I am experimenting with two datasets, and one is about 60GB and the other 600GB.
Simply speaking, my code uses datasets.set_format("torch") function and let pytorch-lightning handle ddp training.
When looking at the pytorch-lightning supported profile of two different runs, I see that fetching a batch(get_train_batch) consumes an unreasonable amount of time when data is large. What could be the cause?

  • 60GB data
Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  200.33         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  71.994         	|1              	|  71.994         	|  35.937         	|
run_training_batch                 	|  0.64373        	|100            	|  64.373         	|  32.133         	|
optimizer_step_and_closure_0       	|  0.64322        	|100            	|  64.322         	|  32.108         	|
training_step_and_backward         	|  0.61004        	|100            	|  61.004         	|  30.452         	|
model_backward                     	|  0.37552        	|100            	|  37.552         	|  18.745         	|
model_forward                      	|  0.22813        	|100            	|  22.813         	|  11.387         	|
training_step                      	|  0.22759        	|100            	|  22.759         	|  11.361         	|
get_train_batch                    	|  0.066385       	|100            	|  6.6385         	|  3.3138         	|
  • 600GB data
Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  3285.6         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  1397.9         	|1              	|  1397.9         	|  42.546         	|
run_training_batch                 	|  7.2596         	|100            	|  725.96         	|  22.095         	|
optimizer_step_and_closure_0       	|  7.2589         	|100            	|  725.89         	|  22.093         	|
training_step_and_backward         	|  7.223          	|100            	|  722.3          	|  21.984         	|
model_backward                     	|  6.9662         	|100            	|  696.62         	|  21.202         	|
get_train_batch                    	|  6.322          	|100            	|  632.2          	|  19.241         	|
model_forward                      	|  0.24902        	|100            	|  24.902         	|  0.75789        	|
training_step                      	|  0.2485         	|100            	|  24.85          	|  0.75633        	|

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions