[Bugfix] Flush TunableOp results before worker processes are destroyed. #13623

naromero77amd · 2025-02-20T18:15:53Z

This PR is a bug fix for users that run TunableOp on multi-GPU vLLM workloads. This is mostly ROCm users.

Currently, TunableOp with worker processes enabled will only write the results for GPU 0 from the main process. Normally, the TunableOp results are flushed to the file system when the TunableOp C++ destructor is called. The worker processes appear to be destroyed before TunableOp destructor is called. What users are experiencing is that the main process ends up with TunableOp results, while the other worker processes that are driving the other GPU end up with an empty TunableOp results file. We now force a flush of the TunableOp results before the worker processes are terminated.

cc: @hongxiayang

github-actions · 2025-02-20T18:16:07Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

hongxiayang · 2025-02-20T18:22:48Z

vllm/executor/multiproc_worker_utils.py

Suggested change

if ((tunable.is_enabled() is True) and

(tunable.tuning_is_enabled() is True) and

(tunable.record_untuned_is_enabled() is False)):

tunable.write_file()

if tunable.is_enabled() and tunable.tuning_is_enabled() and

not tunable.record_untuned_is_enabled():

tunable.write_file()

hongxiayang · 2025-02-21T14:33:55Z

vllm/executor/multiproc_worker_utils.py

Suggested change

if (tunable.is_enabled() and tunable.tuning_is_enabled() and

not tunable.record_untuned_is_enabled()):

if tunable.is_enabled() and tunable.tuning_is_enabled() and

not tunable.record_untuned_is_enabled():

Actually, I tried without the outermost () and I got a syntax error. Maybe because it is split across two lines?

yes, it needs \ to break long if statement into multiple lines, or bracket.

naromero77amd · 2025-02-21T18:09:57Z

One more thing, in case it matters, the tunable.record_untuned_is_enabled() is only available in PyTorch 2.6 or later. Not sure what is the vLLM practice for backwards compatibility.

DarkLight1337 · 2025-02-22T16:27:49Z

Please merge from main and fix the pre-commit errors.

hongxiayang · 2025-02-24T15:55:25Z

One more thing, in case it matters, the tunable.record_untuned_is_enabled() is only available in PyTorch 2.6 or later. Not sure what is the vLLM practice for backwards compatibility.

Can you please put a comment in the code about this?

Signed-off-by: Nichols A. Romero <[email protected]>

…d. (vllm-project#13623) Signed-off-by: Nichols A. Romero <[email protected]>

…d. (vllm-project#13623) Signed-off-by: Nichols A. Romero <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

…d. (vllm-project#13623) Signed-off-by: Nichols A. Romero <[email protected]>

hongxiayang added the rocm Related to AMD ROCm label Feb 20, 2025

hongxiayang reviewed Feb 20, 2025

View reviewed changes

naromero77amd requested a review from hongxiayang February 20, 2025 18:44

hongxiayang reviewed Feb 21, 2025

View reviewed changes

hongxiayang approved these changes Feb 21, 2025

View reviewed changes

DarkLight1337 approved these changes Feb 22, 2025

View reviewed changes

naromero77amd force-pushed the fix_vllm_tunableop_tuning branch from 78882d0 to 86869af Compare February 24, 2025 21:19

naromero77amd added 3 commits February 24, 2025 21:37

Flush TunableOp results before worker processes are destroyed.

de73c85

Signed-off-by: Nichols A. Romero <[email protected]>

Minor clean-up.

ad0f5b3

Signed-off-by: Nichols A. Romero <[email protected]>

lint

96be44d

Signed-off-by: Nichols A. Romero <[email protected]>

naromero77amd force-pushed the fix_vllm_tunableop_tuning branch from 86869af to 96be44d Compare February 24, 2025 21:37

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 24, 2025

mgoin approved these changes Feb 24, 2025

View reviewed changes

mgoin enabled auto-merge (squash) February 24, 2025 23:27

mgoin merged commit fa82074 into vllm-project:main Feb 25, 2025
57 of 59 checks passed

naromero77amd deleted the fix_vllm_tunableop_tuning branch February 25, 2025 15:14

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[Bugfix] Flush TunableOp results before worker processes are destroye…

be48b41

…d. (vllm-project#13623) Signed-off-by: Nichols A. Romero <[email protected]>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Bugfix] Flush TunableOp results before worker processes are destroye…

805624e

…d. (vllm-project#13623) Signed-off-by: Nichols A. Romero <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Bugfix] Flush TunableOp results before worker processes are destroye…

84912ab

…d. (vllm-project#13623) Signed-off-by: Nichols A. Romero <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Flush TunableOp results before worker processes are destroyed. #13623

[Bugfix] Flush TunableOp results before worker processes are destroyed. #13623

Uh oh!

naromero77amd commented Feb 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 20, 2025

Uh oh!

hongxiayang Feb 20, 2025

Uh oh!

hongxiayang Feb 21, 2025

Uh oh!

naromero77amd Feb 21, 2025

Uh oh!

hongxiayang Feb 21, 2025

Uh oh!

naromero77amd commented Feb 21, 2025

Uh oh!

DarkLight1337 commented Feb 22, 2025

Uh oh!

hongxiayang commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-    if ((tunable.is_enabled() is True) and
-        (tunable.tuning_is_enabled() is True) and
-        (tunable.record_untuned_is_enabled() is False)):
-        tunable.write_file()
+    if tunable.is_enabled() and tunable.tuning_is_enabled() and
+       not tunable.record_untuned_is_enabled():
+        tunable.write_file()

Uh oh!

[Bugfix] Flush TunableOp results before worker processes are destroyed. #13623

[Bugfix] Flush TunableOp results before worker processes are destroyed. #13623

Uh oh!

Conversation

naromero77amd commented Feb 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2025

Uh oh!

hongxiayang Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

hongxiayang Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

naromero77amd Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

hongxiayang Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

naromero77amd commented Feb 21, 2025

Uh oh!

DarkLight1337 commented Feb 22, 2025

Uh oh!

hongxiayang commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

naromero77amd commented Feb 20, 2025 •

edited by github-actions bot

Loading