Use rapidsmpf ShufflerAsync #20701

rjzamora · 2025-11-21T18:18:19Z

Description

Now that rapidsai/rapidsmpf#685 is in, we can use ShufflerAsync in cudf-polars for the rapidsmpf runtime. This PR also makes some improvements to the ShuffleContext in preparation for multi-GPU support.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rjzamora · 2025-11-21T18:20:47Z

python/cudf_polars/cudf_polars/experimental/rapidsmpf/core.py

-        ir_context=ir_context,
-    )
+    # Reserve shuffle IDs for the entire pipeline execution
+    with ReserveOpIDs(ir) as shuffle_id_map:


Most of the changes in this file correspond to the tab needed for this ReserveOpIDs context. We also pass the shuffle_id_map into generate_network (and it is added to GenState for pipeline construction). For multi-GPU execution, we need to reserve the shuffle IDs ahead of time, so we might as well make that change now.

python/cudf_polars/cudf_polars/experimental/rapidsmpf/shuffle.py

…e-async-shuffler

python/cudf_polars/cudf_polars/experimental/rapidsmpf/shuffle.py

TomAugspurger · 2025-11-24T19:05:48Z

python/cudf_polars/cudf_polars/experimental/rapidsmpf/shuffle.py

+        # Collect all Shuffle nodes.
+        # NOTE: We will also need to collect Repartition,
+        # and Join nodes to support multi-GPU execution.
+        self.shuffle_nodes: list[IR] = [


When we do support Repartition, will we store them in the same list (shuffle_nodes) / dictionary (shuffle_id_map) as the shuffle nodes? I'm wonder if we can easily future proof the interface at all, so that we don't need a ton of noisy changes when we go to support other types.

Yeah, it's the same list. We can just call it collective_nodes to make that clearer.

I renamed everything to collective_id/collective_nodes and refactored some of the common logic into a separate file.

TomAugspurger · 2025-11-24T19:11:40Z

python/cudf_polars/cudf_polars/experimental/rapidsmpf/collectives/shuffle.py

-        self.shuffler.shutdown()
-        _release_shuffle_id(self.op_id)
+        """Exit the ShuffleContext manager."""
+        del self.shuffler


Is shutting down the shuffler not sufficient anymore? I'm not a huge fan of changing the attributes available on an object (you need to know what context you're in to know whether it's safe to call some method).

And can you remind me what the actual state here that needs to be setup / torn down? Do we just need reliable shutdown of the ShufflerAsync?

We probably don't need this anymore. The original shuffler context was dynamically releasing the shuffle id at runtime. This was working fine for single-GPU execution, but is no-longer reliable.

Ok, one clarification: The shutdown method does not exist for ShufflerAsync (only for the synchronous Shuffler). It's not clear if we really gain much by explicitly cleaning up in an __exit__ definition, but I also don't think it "hurts". Let me know if I should change anything here.

python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py

use async shuffler

7ba3568

rjzamora self-assigned this Nov 21, 2025

rjzamora requested a review from a team as a code owner November 21, 2025 18:18

rjzamora requested review from TomAugspurger and mroeschke November 21, 2025 18:18

rjzamora added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 21, 2025

Merge branch 'main' into use-async-shuffler

87f105c

github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Nov 21, 2025

github-project-automation bot added this to cuDF Python Nov 21, 2025

GPUtester moved this to In Progress in cuDF Python Nov 21, 2025

rjzamora commented Nov 21, 2025

View reviewed changes

python/cudf_polars/cudf_polars/experimental/rapidsmpf/shuffle.py Outdated Show resolved Hide resolved

rjzamora added 4 commits November 21, 2025 12:21

Update python/cudf_polars/cudf_polars/experimental/rapidsmpf/shuffle.py

e63a1f5

Merge branch 'main' into use-async-shuffler

608ccf5

Merge remote-tracking branch 'upstream/main' into use-async-shuffler

f77e363

Merge branch 'use-async-shuffler' of github.com:rjzamora/cudf into us…

a087179

…e-async-shuffler

TomAugspurger reviewed Nov 24, 2025

View reviewed changes

rjzamora added 3 commits November 24, 2025 17:50

rename to collective_id and re-use stream

2c69e96

Merge remote-tracking branch 'upstream/main' into use-async-shuffler

2d3d223

reorg

8469b3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use rapidsmpf ShufflerAsync #20701

Use rapidsmpf ShufflerAsync #20701

Uh oh!

rjzamora commented Nov 21, 2025

Uh oh!

rjzamora Nov 21, 2025

Uh oh!

Uh oh!

Uh oh!

TomAugspurger Nov 24, 2025

Uh oh!

rjzamora Nov 24, 2025

Uh oh!

rjzamora Nov 25, 2025

Uh oh!

TomAugspurger Nov 24, 2025

Uh oh!

rjzamora Nov 24, 2025

Uh oh!

rjzamora Nov 25, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use rapidsmpf ShufflerAsync #20701

Are you sure you want to change the base?

Use rapidsmpf ShufflerAsync #20701

Uh oh!

Conversation

rjzamora commented Nov 21, 2025

Description

Checklist

Uh oh!

rjzamora Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TomAugspurger Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

rjzamora Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

rjzamora Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

rjzamora Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

rjzamora Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants