Skip to content

Conversation

@rjzamora
Copy link
Member

Description

Now that rapidsai/rapidsmpf#685 is in, we can use ShufflerAsync in cudf-polars for the rapidsmpf runtime. This PR also makes some improvements to the ShuffleContext in preparation for multi-GPU support.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora self-assigned this Nov 21, 2025
@rjzamora rjzamora requested a review from a team as a code owner November 21, 2025 18:18
@rjzamora rjzamora added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 21, 2025
@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Nov 21, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Nov 21, 2025
ir_context=ir_context,
)
# Reserve shuffle IDs for the entire pipeline execution
with ReserveOpIDs(ir) as shuffle_id_map:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes in this file correspond to the tab needed for this ReserveOpIDs context. We also pass the shuffle_id_map into generate_network (and it is added to GenState for pipeline construction). For multi-GPU execution, we need to reserve the shuffle IDs ahead of time, so we might as well make that change now.

# Collect all Shuffle nodes.
# NOTE: We will also need to collect Repartition,
# and Join nodes to support multi-GPU execution.
self.shuffle_nodes: list[IR] = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we do support Repartition, will we store them in the same list (shuffle_nodes) / dictionary (shuffle_id_map) as the shuffle nodes? I'm wonder if we can easily future proof the interface at all, so that we don't need a ton of noisy changes when we go to support other types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's the same list. We can just call it collective_nodes to make that clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed everything to collective_id/collective_nodes and refactored some of the common logic into a separate file.

self.shuffler.shutdown()
_release_shuffle_id(self.op_id)
"""Exit the ShuffleContext manager."""
del self.shuffler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is shutting down the shuffler not sufficient anymore? I'm not a huge fan of changing the attributes available on an object (you need to know what context you're in to know whether it's safe to call some method).

And can you remind me what the actual state here that needs to be setup / torn down? Do we just need reliable shutdown of the ShufflerAsync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need this anymore. The original shuffler context was dynamically releasing the shuffle id at runtime. This was working fine for single-GPU execution, but is no-longer reliable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, one clarification: The shutdown method does not exist for ShufflerAsync (only for the synchronous Shuffler). It's not clear if we really gain much by explicitly cleaning up in an __exit__ definition, but I also don't think it "hurts". Let me know if I should change anything here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants