-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Core] Bookkeeping optimization: Batchify updates 1D numpy arrays (e.g. num_tokens, num_tokens_no_spec) #25801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
CC @njhill @WoosukKwon for awareness I'm wondering when async scheduler is landed, would the bookkeeping costs also hide in a separate process. |
|
This pull request has merge conflicts that must be resolved before it can be |
vllm/v1/worker/gpu_model_runner.py
Outdated
| # - self.input_batch.token_ids_cpu[req_idx, | ||
| # start_idx:end_idx] = sampled_ids | ||
| base_idx = req_idx * token_ids_cpu_column_cnt | ||
| token_ids_cpu_flatten_indices.extend( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit concerned if end_idx - start_idx is a large number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think it's a valid concern. IIUC, we will only assign output tokens here, so most of the time is 1, and at most num_spec_tokens.
But to be very honest, I also need to confirm the prompt tokens are not appending here (if yes, this could be huge).
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can write a custom CPU op to take care of this case,
like batch_assign_2d(indices), indices is 1D tensor, consists of 3 elements, 0 dim index, 1 dim start and end.
|
Per offline discussion, we will split the PR a bit:
|
After updated the benchmark scripts (which better reflects the actual bookkeeping usage, the change is shown to be a clear win. PTAL @houseroad |
|
Gentle nudge @houseroad for the review :P |
|
|
||
| # Collect updates in the for loop and apply a batch update at the end | ||
| # to vectorize updates to tensors and numpy arrays. | ||
| start_indices = self.input_batch.num_tokens_no_spec.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to convert to list for start_indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
This pull request has merge conflicts that must be resolved before it can be |
| # to vectorize updates to tensors and numpy arrays. | ||
| start_indices = self.input_batch.num_tokens_no_spec.tolist() | ||
| # Indices and values to update for num_tokens and num_tokens_no_spec | ||
| num_tokens_indices_to_update: list[int] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use np_array here? It may be lighter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know the total elements to update ahead of time. Even if it does, updating the np array elements within a for loop one at a time might not be efficient.
Please let me know if I missed anything.
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>

Purpose
Currently, GPUModelRunner._bookkeeping_sync interleaves numpy updates and python logics which is inefficient, and we could see scattered tensor and numpy array updates which consumes significant amount of times.
In this change, we simply vectorize the tensor and numpy updates
Update
We only apply batchify to 1D array in this PR, will do the 2D array/tensor updates as followup.
Test Plan & Test Result
Benchmark
We introduced a benchmark script to measure the win. It's a clear performance boost when all rows are updated (5x throughput boost), but a regression when <10% of rows are updated). But in real work scenarios, we believe most of the rows are updated, so it should still be a consistent improvement to the system.
Correctness
Optimization

~3x speedup with the change per trace
Per gptoss AIME 2025 eval runs
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.