dataloading slow when using HUGE dataset

Hi,

When I use datasets with 600GB data, the dataloading speed increases significantly. 
I am experimenting with two datasets, and one is about 60GB and the other 600GB.
Simply speaking, my code uses `datasets.set_format("torch")` function and let pytorch-lightning handle ddp training.
When looking at the pytorch-lightning supported profile of two different runs, I see that fetching a batch(`get_train_batch`) consumes an unreasonable amount of time when data is large. What could be the cause?

* 60GB data
```
Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  200.33         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  71.994         	|1              	|  71.994         	|  35.937         	|
run_training_batch                 	|  0.64373        	|100            	|  64.373         	|  32.133         	|
optimizer_step_and_closure_0       	|  0.64322        	|100            	|  64.322         	|  32.108         	|
training_step_and_backward         	|  0.61004        	|100            	|  61.004         	|  30.452         	|
model_backward                     	|  0.37552        	|100            	|  37.552         	|  18.745         	|
model_forward                      	|  0.22813        	|100            	|  22.813         	|  11.387         	|
training_step                      	|  0.22759        	|100            	|  22.759         	|  11.361         	|
get_train_batch                    	|  0.066385       	|100            	|  6.6385         	|  3.3138         	|
```

* 600GB data
```
Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  3285.6         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  1397.9         	|1              	|  1397.9         	|  42.546         	|
run_training_batch                 	|  7.2596         	|100            	|  725.96         	|  22.095         	|
optimizer_step_and_closure_0       	|  7.2589         	|100            	|  725.89         	|  22.093         	|
training_step_and_backward         	|  7.223          	|100            	|  722.3          	|  21.984         	|
model_backward                     	|  6.9662         	|100            	|  696.62         	|  21.202         	|
get_train_batch                    	|  6.322          	|100            	|  632.2          	|  19.241         	|
model_forward                      	|  0.24902        	|100            	|  24.902         	|  0.75789        	|
training_step                      	|  0.2485         	|100            	|  24.85          	|  0.75633        	|
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dataloading slow when using HUGE dataset #2210

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

dataloading slow when using HUGE dataset #2210

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions