Skip to content

Conversation

@clefourrier
Copy link
Member

No description provided.

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lewtun
Copy link
Member

lewtun commented Mar 31, 2025

If you're looking for a good model / benchmark to test with, I suggest trying AIME24 with a few of the DeepSeek distilled models:

Screenshot 2025-03-31 at 11 35 09

Since AIME24 is just 30 prompts, seeing the effect of math_pass_at_1 with n=4,16,32,64 would be quite interesting. @NathanHB probably has the lighteval commands at hand to run this natively, but it should be something close to what we do in Open R1: https://github.com/huggingface/open-r1/tree/main?tab=readme-ov-file#evaluating-models

Note: The reason I mention math_pass_at_1 is that pass@1 is one of the most popular metrics used to evaluate reasoning models. math_pass_at_k for k>1 is used more frequently for base models AFAIK

@clefourrier
Copy link
Member Author

Perfect ty! Do you have an order of magnitude on the time it would normally take? (I think my run is currently stuck, I'm using the 7B distill qwen and it's been running for 30 min on 8H100, - using pass at 64 though and GPUs are full usage ^^")

@lewtun
Copy link
Member

lewtun commented Mar 31, 2025

Perfect ty! Do you have an order of magnitude on the time it would normally take? (I think my run is currently stuck, I'm using the 7B distill qwen and it's been running for 30 min on 8H100, - using pass at 64 though and GPUs are full usage ^^")

I think O(hour) is about right for these long CoT models, especially since you're generating 30 x 64 =1'920 completions, typically with 32k tokens per completion. Are you running with DDP? That will significantly speed up the throughput

@clefourrier
Copy link
Member Author

I was using accelerate, moved to vllm ^^ - some issues with vllm and DP (ray workers were not gathering properly) so I'm using TP now

@clefourrier
Copy link
Member Author

clefourrier commented Mar 31, 2025

I'm getting

|lighteval:aime24:0|      1|math_pass@1:4_samples |0.5250|±  |0.0733|
|                  |       |math_pass@1:8_samples |0.5583|±  |0.0708|
|                  |       |math_pass@1:16_samples|0.5354|±  |0.0706|
|                  |       |math_pass@1:32_samples|0.5115|±  |0.0699|
|                  |       |math_pass@1:64_samples|0.5167|±  |0.0680|
|                  |       |extractive_match       |0.5333|±  |0.0926|

You do want the pass@ from this paper, right? https://arxiv.org/pdf/2107.03374
Not an "any pass at n" like in the original Kulal paper?

do you know what the cons@64 is? is it a maj@64?

@clefourrier
Copy link
Member Author

And for the DeepSeek-R1-Distill-Qwen-14B

|lighteval:aime24:0|      1|math_pass@1:4_samples |0.6833|±  |0.0748|
|                  |       |math_pass@1:8_samples |0.6792|±  |0.0701|
|                  |       |math_pass@1:16_samples|0.6771|±  |0.0693|
|                  |       |math_pass@1:32_samples|0.6844|±  |0.0670|
|                  |       |math_pass@1:64_samples|0.6708|±  |0.0666|
|                  |       |extractive_match      |0.8000|±  |0.0743|

@lewtun
Copy link
Member

lewtun commented Apr 1, 2025

You do want the pass@ from this paper, right? https://arxiv.org/pdf/2107.03374

Yes that's right!

do you know what the cons@64 is? is it a maj@64?

Yes, it's maj@64 because OpenAI decided to rebrand it to align with the "self-consistency decoding" jargon. I suggest we stick with maj@k since it's simpler to understand.

@clefourrier
Copy link
Member Author

Ok, then I'll do a couple more tests but I think we're good :)

@lewtun
Copy link
Member

lewtun commented Apr 1, 2025

Great results with the DeepSeek models! Just to check: are you using temperature=0.6 and top_p=0.95? That's what DeepSeek used in their report.

Also, do you know what the variance for the different math_pass@1 values is with different seeds? I'm mostly wondering if it's safe to use math_pass@1:4_samples (cheapest) or if more samples are needed to avoid large variance across repeated runs.

@clefourrier
Copy link
Member Author

Yep I used the parameters you linked in the command! :)
For the variance, I could run it from the details I think, let me check (@NathanHB we're good to review)

@clefourrier clefourrier requested a review from NathanHB April 3, 2025 09:56
@NathanHB NathanHB merged commit fcb784d into main Apr 4, 2025
4 checks passed
@CurryxIaoHu
Copy link

Hi, I want to ask a simple question: what does "math_pass@1:4_samples" mean? Does it mean that for a question, we sample 4 times and then look at the average accuracy of these four answers?

@clefourrier
Copy link
Member Author

Not exactly, we're following eq 1 of this paper: https://arxiv.org/pdf/2107.03374

what's usually called "pass@k" -> "Kulal et al. (2019) evaluate functional correctness using the pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported."
what we use for pass@k:n samples -> "However, computing pass@k in this way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator (1)"

@CurryxIaoHu
Copy link

Not exactly, we're following eq 1 of this paper: https://arxiv.org/pdf/2107.03374

what's usually called "pass@k" -> "Kulal et al. (2019) evaluate functional correctness using the pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported." what we use for pass@k:n samples -> "However, computing pass@k in this way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator (1)"

Thanks for your answer! I want to ensure whether my understanding is correct: So for pass@1:n samples, assuming there are c correct answers in n samples, using eq 1 provided in this paper, pass@1:n = c/n?

@clefourrier
Copy link
Member Author

How do you get to c/n?

@lewtun
Copy link
Member

lewtun commented May 7, 2025

@CurryxIaoHu is correct: in the special case where $k=1$, Eq. (1) from the Codex paper reduces to:

$$ pass@1 = \mathbb{E} \left[ 1 - \frac{n-c}{n} \right] = \mathbb{E}\left[\frac{c}{n}\right] = \frac{1}{n} \sum_{i=0}^n p_i$$

where $p_i$ is correctness of the ith response.

@CurryxIaoHu
Copy link

Thank you very much for your quick response!!

@clefourrier
Copy link
Member Author

Tysm for checking @lewtun , I missed one of the lower terms when decomposing the binomial terms to factorials ^^''

hynky1999 pushed a commit that referenced this pull request May 22, 2025
* test 1

* change task

* change

* fix names
NathanHB pushed a commit that referenced this pull request Sep 19, 2025
* test 1

* change task

* change

* fix names
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants