Pass At K Math #647

clefourrier · 2025-03-27T13:42:28Z

No description provided.

HuggingFaceDocBuilderDev · 2025-03-27T13:44:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lewtun · 2025-03-31T09:37:38Z

If you're looking for a good model / benchmark to test with, I suggest trying AIME24 with a few of the DeepSeek distilled models:

Since AIME24 is just 30 prompts, seeing the effect of math_pass_at_1 with n=4,16,32,64 would be quite interesting. @NathanHB probably has the lighteval commands at hand to run this natively, but it should be something close to what we do in Open R1: https://github.com/huggingface/open-r1/tree/main?tab=readme-ov-file#evaluating-models

Note: The reason I mention math_pass_at_1 is that pass@1 is one of the most popular metrics used to evaluate reasoning models. math_pass_at_k for k>1 is used more frequently for base models AFAIK

clefourrier · 2025-03-31T10:01:37Z

Perfect ty! Do you have an order of magnitude on the time it would normally take? (I think my run is currently stuck, I'm using the 7B distill qwen and it's been running for 30 min on 8H100, - using pass at 64 though and GPUs are full usage ^^")

lewtun · 2025-03-31T12:38:26Z

Perfect ty! Do you have an order of magnitude on the time it would normally take? (I think my run is currently stuck, I'm using the 7B distill qwen and it's been running for 30 min on 8H100, - using pass at 64 though and GPUs are full usage ^^")

I think O(hour) is about right for these long CoT models, especially since you're generating 30 x 64 =1'920 completions, typically with 32k tokens per completion. Are you running with DDP? That will significantly speed up the throughput

clefourrier · 2025-03-31T12:54:14Z

I was using accelerate, moved to vllm ^^ - some issues with vllm and DP (ray workers were not gathering properly) so I'm using TP now

src/lighteval/tasks/default_tasks.py

clefourrier · 2025-03-31T15:52:57Z

I'm getting

|lighteval:aime24:0|      1|math_pass@1:4_samples |0.5250|±  |0.0733|
|                  |       |math_pass@1:8_samples |0.5583|±  |0.0708|
|                  |       |math_pass@1:16_samples|0.5354|±  |0.0706|
|                  |       |math_pass@1:32_samples|0.5115|±  |0.0699|
|                  |       |math_pass@1:64_samples|0.5167|±  |0.0680|
|                  |       |extractive_match       |0.5333|±  |0.0926|

You do want the pass@ from this paper, right? https://arxiv.org/pdf/2107.03374
Not an "any pass at n" like in the original Kulal paper?

do you know what the cons@64 is? is it a maj@64?

clefourrier · 2025-03-31T19:52:53Z

And for the DeepSeek-R1-Distill-Qwen-14B

|lighteval:aime24:0|      1|math_pass@1:4_samples |0.6833|±  |0.0748|
|                  |       |math_pass@1:8_samples |0.6792|±  |0.0701|
|                  |       |math_pass@1:16_samples|0.6771|±  |0.0693|
|                  |       |math_pass@1:32_samples|0.6844|±  |0.0670|
|                  |       |math_pass@1:64_samples|0.6708|±  |0.0666|
|                  |       |extractive_match      |0.8000|±  |0.0743|

lewtun · 2025-04-01T09:21:07Z

You do want the pass@ from this paper, right? https://arxiv.org/pdf/2107.03374

Yes that's right!

do you know what the cons@64 is? is it a maj@64?

Yes, it's maj@64 because OpenAI decided to rebrand it to align with the "self-consistency decoding" jargon. I suggest we stick with maj@k since it's simpler to understand.

clefourrier · 2025-04-01T09:23:43Z

Ok, then I'll do a couple more tests but I think we're good :)

lewtun · 2025-04-01T09:24:57Z

Great results with the DeepSeek models! Just to check: are you using temperature=0.6 and top_p=0.95? That's what DeepSeek used in their report.

Also, do you know what the variance for the different math_pass@1 values is with different seeds? I'm mostly wondering if it's safe to use math_pass@1:4_samples (cheapest) or if more samples are needed to avoid large variance across repeated runs.

clefourrier · 2025-04-03T09:56:45Z

Yep I used the parameters you linked in the command! :)
For the variance, I could run it from the details I think, let me check (@NathanHB we're good to review)

CurryxIaoHu · 2025-05-07T13:56:46Z

Hi, I want to ask a simple question: what does "math_pass@1:4_samples" mean? Does it mean that for a question, we sample 4 times and then look at the average accuracy of these four answers?

clefourrier · 2025-05-07T14:01:53Z

Not exactly, we're following eq 1 of this paper: https://arxiv.org/pdf/2107.03374

what's usually called "pass@k" -> "Kulal et al. (2019) evaluate functional correctness using the pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported."
what we use for pass@k:n samples -> "However, computing pass@k in this way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator (1)"

CurryxIaoHu · 2025-05-07T14:12:42Z

Not exactly, we're following eq 1 of this paper: https://arxiv.org/pdf/2107.03374

what's usually called "pass@k" -> "Kulal et al. (2019) evaluate functional correctness using the pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported." what we use for pass@k:n samples -> "However, computing pass@k in this way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator (1)"

Thanks for your answer! I want to ensure whether my understanding is correct: So for pass@1:n samples, assuming there are c correct answers in n samples, using eq 1 provided in this paper, pass@1:n = c/n?

clefourrier · 2025-05-07T14:30:41Z

How do you get to c/n?

lewtun · 2025-05-07T18:14:49Z

@CurryxIaoHu is correct: in the special case where $k=1$, Eq. (1) from the Codex paper reduces to:

$$ pass@1 = \mathbb{E} \left[ 1 - \frac{n-c}{n} \right] = \mathbb{E}\left[\frac{c}{n}\right] = \frac{1}{n} \sum_{i=0}^n p_i$$

where $p_i$ is correctness of the ith response.

CurryxIaoHu · 2025-05-08T03:35:10Z

Thank you very much for your quick response!!

clefourrier · 2025-05-08T08:20:55Z

Tysm for checking @lewtun , I missed one of the lower terms when decomposing the binomial terms to factorials ^^''

* test 1 * change task * change * fix names

test 1

e6a073f

change task

c0fb8a8

lewtun reviewed Mar 31, 2025

View reviewed changes

src/lighteval/tasks/default_tasks.py Outdated Show resolved Hide resolved

change

3c68f4d

fix names

2ddae37

Tim-Siu mentioned this pull request Apr 1, 2025

[FT] Support pass@k and average pass@1 #653

Closed

clefourrier requested a review from NathanHB April 3, 2025 09:56

NathanHB approved these changes Apr 4, 2025

View reviewed changes

NathanHB merged commit fcb784d into main Apr 4, 2025
4 checks passed

NathanHB added the task-update label May 5, 2025

hynky1999 pushed a commit that referenced this pull request May 22, 2025

Pass At K Math (#647)

89985bd

* test 1 * change task * change * fix names

NathanHB pushed a commit that referenced this pull request Sep 19, 2025

Pass At K Math (#647)

829bc58

* test 1 * change task * change * fix names

Pass At K Math #647

Pass At K Math #647

Uh oh!

Conversation

clefourrier commented Mar 27, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 27, 2025

Uh oh!

lewtun commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clefourrier commented Mar 31, 2025

Uh oh!

lewtun commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clefourrier commented Mar 31, 2025

Uh oh!

Uh oh!

clefourrier commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clefourrier commented Mar 31, 2025

Uh oh!

lewtun commented Apr 1, 2025

Uh oh!

clefourrier commented Apr 1, 2025

Uh oh!

lewtun commented Apr 1, 2025

Uh oh!

clefourrier commented Apr 3, 2025

Uh oh!

Uh oh!

CurryxIaoHu commented May 7, 2025

Uh oh!

clefourrier commented May 7, 2025

Uh oh!

CurryxIaoHu commented May 7, 2025

Uh oh!

clefourrier commented May 7, 2025

Uh oh!

lewtun commented May 7, 2025

Uh oh!

CurryxIaoHu commented May 8, 2025

Uh oh!

clefourrier commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lewtun commented Mar 31, 2025 •

edited

Loading

lewtun commented Mar 31, 2025 •

edited

Loading

clefourrier commented Mar 31, 2025 •

edited

Loading