-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Describe the bug
I have a somewhat unclear bug to me, where I can't figure out what the problem is. The code works as expected on a small subset of my dataset (2000 samples) on my local machine, but when I execute the same code with a larger dataset (1.4 million samples) this problem occurs. Thats why I can't give exact steps to reproduce, I'm sorry.
I process a large dataset in a two step process. I first call map on a dataset I load from disk and create a new dataset from it. This works like expected and map uses all workers I started it with. Then I process the dataset created by the first step, again with map, which is really slow and starting only one or two process at a time. Number of processes is the same for both steps.
pseudo code:
ds = datasets.load_from_disk("path")
new_dataset = ds.map(work, batched=True, ...) # fast uses all processes
final_dataset = new_dataset.map(work2, batched=True, ...) # slow starts one process after anotherExpected results
Second stage should be as fast as the first stage.
Versions
Paste the output of the following code:
- Datasets: 1.5.0
- Python: 3.8.8 (default, Feb 24 2021, 21:46:12)
- Platform: Linux-5.4.0-60-generic-x86_64-with-glibc2.10
Do you guys have any idea? Thanks a lot!