[EVAL] Long Horizon Execution #1074

akshathmangudi · 2025-11-21T10:57:49Z

I screwed up my previous git clone, so I had to redo the changes 😅

Description:
Approach described within #1056.

Tasks:

Initial scaffolding of /tasks/tasks/long_horizon_execution.py
Implement a custom scorer to parse <answer> tags.
Complete implementation of /tasks/tasks/long_horizon_execution.py
Evaluation and Testing

STATUS: ready for review.

Current behavior:

When we run lighteval tasks inspect long_horizon_execution, the output has been shown below:

... more lines
           "'basic', 'alive', 'cream', 'dress', 'black', 'brown', 'drama', "
           "'black', 'audio', 'brown', 'album', 'cover', 'avoid', 'aware', "
           "'event', 'dream', 'clean', 'clock', 'apple', 'above', 'close', "
           "'begin', 'allow', 'album', 'draft', 'brain', 'civil', 'faith', "
           "'death', 'coach', 'below', 'doubt', 'aware', 'cover', 'final', "
           "'allow', 'avoid', 'ahead', 'cross', 'child', 'cream', 'error', "
           "'break', 'brief', 'clock', 'final', 'dance', 'award', 'every', "
           "'chief', 'could', 'dream', 'begin', 'burst', 'audio', 'album', "
           "'cross', 'doubt', 'blood', 'child', 'brand', 'brand', 'extra', "
           "'broad', 'cloud', 'check', 'after', 'chart', 'basic', 'child', "
           "'coach', 'chair', 'faith', 'earth', 'audio', 'basic', 'field', "
           "'cloud', 'draft', 'apply', 'court', 'black', 'ahead', 'burst', "
           "'crowd', 'depth', 'enemy', 'drink', 'first', 'could', 'false', "
           "'could', 'blame', 'first', 'album', 'crowd', 'first', 'broad', "
           "'extra', 'clock', 'chart', 'fiber', 'board', 'earth', 'being', "
           "'alive', 'chart', 'avoid', 'dress', 'cloud', 'clean', 'avoid', "
           "'crash', 'clean', 'arise', 'death', 'brand', 'error']\n"
           '\n'
           'Your task: Calculate the cumulative sum after each key. The first '
           'sum is just the value of the first key. The second sum is the '
           'first value plus the second value, and so on.\n'
           '\n'
           'IMPORTANT:\n'
           '- Output your answer as a single line with comma-separated values '
           'inside <answer></answer> tags\n'
           '- Do not include any other text outside the answer tags\n'
           '- Format: <answer>value1,value2,value3,...</answer>\n'
           '- Example: If the cumulative sums are [5, 8, 12], output: '
           '<answer>5,8,12</answer>\n'
           '\n'
           'Your answer:',
  'sampling_methods': [],
  'specific': None,
  'stop_sequences': (),
  'task_name': 'long_horizon_execution',
  'unconditioned_query': None,
  'use_logits': False}

akshathmangudi · 2025-11-21T10:59:50Z

cc: @NathanHB

NathanHB · 2025-11-21T12:42:28Z

looking good ! Will run locally and review today or start of next week :)
Can you share a HUggingFace Space with the samples as described here to make it easier to verify ? 🤗

akshathmangudi · 2025-11-22T12:17:27Z

i ran the benchmark on HF Inference's gpt-4o but a lot of the results I am seeing are quite poor. is this expected or something wrong with the prompting that I haven't looked at yet?

https://huggingface.co/spaces/akshathmangudi/lhe-gpt4o-single

…hteval into akshath/issue-1056-v2

NathanHB

Hey ! Thanks for the hard work on this, i'm testing it locally right now. I have some small nits but it's looking almost ready !

src/lighteval/tasks/tasks/long_horizon_execution/__init__.py

src/lighteval/tasks/tasks/long_horizon_execution/single_turn.py

src/lighteval/tasks/tasks/long_horizon_execution/constants.py

src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py

NathanHB

Tested on single turn, working great with the few nits I added above. However i cannot seems to make the multiturn work, can you ping when it's ready?

akshathmangudi · 2025-11-25T12:50:59Z

@NathanHB it should be working now, ive created a link below that tests both single and multi-turn.

https://huggingface.co/spaces/akshathmangudi/lhe-gpt

ready for review

cef0b0f

akshathmangudi mentioned this pull request Nov 21, 2025

[EVAL] Long Horizon Execution #1072

Closed

4 tasks

akshathmangudi marked this pull request as ready for review November 21, 2025 10:59

Merge branch 'main' into akshath/issue-1056-v2

fdc9288

akshathmangudi added 2 commits November 22, 2025 17:47

some fixes

2c0ceae

Merge branch 'akshath/issue-1056-v2' of github.com:akshathmangudi/lig…

3d8ac1b

…hteval into akshath/issue-1056-v2