-
Notifications
You must be signed in to change notification settings - Fork 373
Pass At K Math #647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass At K Math #647
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
If you're looking for a good model / benchmark to test with, I suggest trying AIME24 with a few of the DeepSeek distilled models: Since AIME24 is just 30 prompts, seeing the effect of Note: The reason I mention |
|
Perfect ty! Do you have an order of magnitude on the time it would normally take? (I think my run is currently stuck, I'm using the 7B distill qwen and it's been running for 30 min on 8H100, - using pass at 64 though and GPUs are full usage ^^") |
I think O(hour) is about right for these long CoT models, especially since you're generating 30 x 64 =1'920 completions, typically with 32k tokens per completion. Are you running with DDP? That will significantly speed up the throughput |
|
I was using accelerate, moved to vllm ^^ - some issues with vllm and DP (ray workers were not gathering properly) so I'm using TP now |
|
I'm getting You do want the pass@ from this paper, right? https://arxiv.org/pdf/2107.03374 do you know what the cons@64 is? is it a maj@64? |
|
And for the DeepSeek-R1-Distill-Qwen-14B |
Yes that's right!
Yes, it's |
|
Ok, then I'll do a couple more tests but I think we're good :) |
|
Great results with the DeepSeek models! Just to check: are you using Also, do you know what the variance for the different |
|
Yep I used the parameters you linked in the command! :) |
|
Hi, I want to ask a simple question: what does "math_pass@1:4_samples" mean? Does it mean that for a question, we sample 4 times and then look at the average accuracy of these four answers? |
|
Not exactly, we're following eq 1 of this paper: https://arxiv.org/pdf/2107.03374 what's usually called "pass@k" -> "Kulal et al. (2019) evaluate functional correctness using the pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported." |
Thanks for your answer! I want to ensure whether my understanding is correct: So for pass@1:n samples, assuming there are c correct answers in n samples, using eq 1 provided in this paper, pass@1:n = c/n? |
|
How do you get to c/n? |
|
@CurryxIaoHu is correct: in the special case where where |
|
Thank you very much for your quick response!! |
|
Tysm for checking @lewtun , I missed one of the lower terms when decomposing the binomial terms to factorials ^^'' |
* test 1 * change task * change * fix names
* test 1 * change task * change * fix names

No description provided.