Skip to content

Conversation

@naromero77amd
Copy link
Contributor

@naromero77amd naromero77amd commented Feb 20, 2025

This PR is a bug fix for users that run TunableOp on multi-GPU vLLM workloads. This is mostly ROCm users.

Currently, TunableOp with worker processes enabled will only write the results for GPU 0 from the main process. Normally, the TunableOp results are flushed to the file system when the TunableOp C++ destructor is called. The worker processes appear to be destroyed before TunableOp destructor is called. What users are experiencing is that the main process ends up with TunableOp results, while the other worker processes that are driving the other GPU end up with an empty TunableOp results file. We now force a flush of the TunableOp results before the worker processes are terminated.

cc: @hongxiayang

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@hongxiayang hongxiayang added the rocm Related to AMD ROCm label Feb 20, 2025
Comment on lines 256 to 260
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ((tunable.is_enabled() is True) and
(tunable.tuning_is_enabled() is True) and
(tunable.record_untuned_is_enabled() is False)):
tunable.write_file()
if tunable.is_enabled() and tunable.tuning_is_enabled() and
not tunable.record_untuned_is_enabled():
tunable.write_file()

Comment on lines 256 to 257
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (tunable.is_enabled() and tunable.tuning_is_enabled() and
not tunable.record_untuned_is_enabled()):
if tunable.is_enabled() and tunable.tuning_is_enabled() and
not tunable.record_untuned_is_enabled():

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I tried without the outermost () and I got a syntax error. Maybe because it is split across two lines?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it needs \ to break long if statement into multiple lines, or bracket.

@naromero77amd
Copy link
Contributor Author

One more thing, in case it matters, the tunable.record_untuned_is_enabled() is only available in PyTorch 2.6 or later. Not sure what is the vLLM practice for backwards compatibility.

@DarkLight1337
Copy link
Member

Please merge from main and fix the pre-commit errors.

@hongxiayang
Copy link
Collaborator

One more thing, in case it matters, the tunable.record_untuned_is_enabled() is only available in PyTorch 2.6 or later. Not sure what is the vLLM practice for backwards compatibility.

Can you please put a comment in the code about this?

@naromero77amd naromero77amd force-pushed the fix_vllm_tunableop_tuning branch from 78882d0 to 86869af Compare February 24, 2025 21:19
Signed-off-by: Nichols A. Romero <[email protected]>
Signed-off-by: Nichols A. Romero <[email protected]>
@naromero77amd naromero77amd force-pushed the fix_vllm_tunableop_tuning branch from 86869af to 96be44d Compare February 24, 2025 21:37
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 24, 2025
@mgoin mgoin enabled auto-merge (squash) February 24, 2025 23:27
@mgoin mgoin merged commit fa82074 into vllm-project:main Feb 25, 2025
57 of 59 checks passed
@naromero77amd naromero77amd deleted the fix_vllm_tunableop_tuning branch February 25, 2025 15:14
Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants