-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Description
The speculative sampling kernel introduced via pull request #3373 (sgl-kernel/src/sgl-kernel/csrc/speculative_sampling.cuh) appears to be incomplete. There are clear comments like // FIXME: leverage draft probs indicating that the implementation does not yet sample from the drafter's probability vector, instead defaulting to zeros as placeholder values.
This discrepancy directly undermines speculative decoding’s effectiveness. If not fixed, it significantly reduces the acceptance rate:
-
Desired behavior (sampling):
Acceptance rate = ∑ₓ min(p(x), q(x))
where p(x) is the target model’s distribution and q(x) is the drafter’s. -
Current behavior (greedy/zeros):
Acceptance rate = p(x)
References
-
The kernel file with the FIXME comment:
sgl-kernel/src/sgl-kernel/csrc/speculative_sampling.cuhas introduced in pull request support speculative decoding kernel in sgl-kernel #3373.
PR link -
Slack discussion highlighting that
draft_probsmay be set to zeros:
eagle_utils.py#L474
Questions & Action Items
- Is the drafter's probability vector intended to be sampled from in the current kernel implementation?
- If not, is there a timeline or PR planned to update the kernel to properly incorporate sampling from draft probabilities?
This issue blocks #9539, and may also block #8581 and #8391.