Skip to content

Map is slow and processes batches one after another #2243

@villmow

Description

@villmow

Describe the bug

I have a somewhat unclear bug to me, where I can't figure out what the problem is. The code works as expected on a small subset of my dataset (2000 samples) on my local machine, but when I execute the same code with a larger dataset (1.4 million samples) this problem occurs. Thats why I can't give exact steps to reproduce, I'm sorry.

I process a large dataset in a two step process. I first call map on a dataset I load from disk and create a new dataset from it. This works like expected and map uses all workers I started it with. Then I process the dataset created by the first step, again with map, which is really slow and starting only one or two process at a time. Number of processes is the same for both steps.

pseudo code:

ds = datasets.load_from_disk("path")
new_dataset = ds.map(work, batched=True, ...)  # fast uses all processes
final_dataset = new_dataset.map(work2, batched=True, ...)  # slow starts one process after another

Expected results

Second stage should be as fast as the first stage.

Versions

Paste the output of the following code:

  • Datasets: 1.5.0
  • Python: 3.8.8 (default, Feb 24 2021, 21:46:12)
  • Platform: Linux-5.4.0-60-generic-x86_64-with-glibc2.10

Do you guys have any idea? Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions