Skip to content

[Question] Speculative Decoding: Sampling from the draft probs vector #9877

@keyboardAnt

Description

@keyboardAnt

Description

The speculative sampling kernel introduced via pull request #3373 (sgl-kernel/src/sgl-kernel/csrc/speculative_sampling.cuh) appears to be incomplete. There are clear comments like // FIXME: leverage draft probs indicating that the implementation does not yet sample from the drafter's probability vector, instead defaulting to zeros as placeholder values.

This discrepancy directly undermines speculative decoding’s effectiveness. If not fixed, it significantly reduces the acceptance rate:

  • Desired behavior (sampling):
    Acceptance rate = ∑ₓ min(p(x), q(x))
    where p(x) is the target model’s distribution and q(x) is the drafter’s.

  • Current behavior (greedy/zeros):
    Acceptance rate = p(x)


References


Questions & Action Items

  1. Is the drafter's probability vector intended to be sampled from in the current kernel implementation?
  2. If not, is there a timeline or PR planned to update the kernel to properly incorporate sampling from draft probabilities?

This issue blocks #9539, and may also block #8581 and #8391.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions