[WIP] Large scale rejection sampling with VLLM #18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOt intended to merge
This is a POC using Deepseek R (70B distlillation) 1, for some reason I'm running into connectivity issues for Qwen for large scale rejection sampling
If you squint it's a form of RL where you have a policy (generated kernels) with a tradeoff between explore (generate completely random samples) and exploit (generate using the best existing samples). It's not really gradient based RL but infra for that tends to be more complex whereas with this approach you can just purely use inference engines
python scripts/run_vllm_prototype.py --operations relu
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
The core idea is that we have
An obvious problem is that utilization is insanely low for executors but that might be unavoidable considering noisy neighbor problems
Prototyping on 8 GPUs but the goal is that this should run on 1,000
