[EVAL] Long Horizon Execution #1072

akshathmangudi · 2025-11-20T12:46:47Z

Approach described within #1056.

Tasks:

Initial scaffolding of /tasks/tasks/long_horizon_execution.py
Implement a custom scorer to parse <answer> tags.
Complete implementation of /tasks/tasks/long_horizon_execution.py
Evaluation and Testing

STATUS: ready for review.

Current behavior:

When we run lighteval tasks inspect long_horizon_execution, the output has been shown below:

... more lines
           "'basic', 'alive', 'cream', 'dress', 'black', 'brown', 'drama', "
           "'black', 'audio', 'brown', 'album', 'cover', 'avoid', 'aware', "
           "'event', 'dream', 'clean', 'clock', 'apple', 'above', 'close', "
           "'begin', 'allow', 'album', 'draft', 'brain', 'civil', 'faith', "
           "'death', 'coach', 'below', 'doubt', 'aware', 'cover', 'final', "
           "'allow', 'avoid', 'ahead', 'cross', 'child', 'cream', 'error', "
           "'break', 'brief', 'clock', 'final', 'dance', 'award', 'every', "
           "'chief', 'could', 'dream', 'begin', 'burst', 'audio', 'album', "
           "'cross', 'doubt', 'blood', 'child', 'brand', 'brand', 'extra', "
           "'broad', 'cloud', 'check', 'after', 'chart', 'basic', 'child', "
           "'coach', 'chair', 'faith', 'earth', 'audio', 'basic', 'field', "
           "'cloud', 'draft', 'apply', 'court', 'black', 'ahead', 'burst', "
           "'crowd', 'depth', 'enemy', 'drink', 'first', 'could', 'false', "
           "'could', 'blame', 'first', 'album', 'crowd', 'first', 'broad', "
           "'extra', 'clock', 'chart', 'fiber', 'board', 'earth', 'being', "
           "'alive', 'chart', 'avoid', 'dress', 'cloud', 'clean', 'avoid', "
           "'crash', 'clean', 'arise', 'death', 'brand', 'error']\n"
           '\n'
           'Your task: Calculate the cumulative sum after each key. The first '
           'sum is just the value of the first key. The second sum is the '
           'first value plus the second value, and so on.\n'
           '\n'
           'IMPORTANT:\n'
           '- Output your answer as a single line with comma-separated values '
           'inside <answer></answer> tags\n'
           '- Do not include any other text outside the answer tags\n'
           '- Format: <answer>value1,value2,value3,...</answer>\n'
           '- Example: If the cumulative sums are [5, 8, 12], output: '
           '<answer>5,8,12</answer>\n'
           '\n'
           'Your answer:',
  'sampling_methods': [],
  'specific': None,
  'stop_sequences': (),
  'task_name': 'long_horizon_execution',
  'unconditioned_query': None,
  'use_logits': False}

akshathmangudi · 2025-11-20T12:48:39Z

Tagging #1069 for better readability.

NathanHB · 2025-11-20T14:53:21Z

great !! Pulled and tested it, I modified a few things (mainly just to have multiple prompt lengths) and pushed a log dir to check the prompt and results.

@akshathmangudi are you ok with me pushing them ?

https://huggingface.co/spaces/SaylorTwift/long_horizon_execution

cc: @shash42 does it look good to you ? :)

akshathmangudi · 2025-11-20T15:05:01Z

yes ofc, is this something that i will have to keep note of in future PRs as well when i integrate benchmarks?

NathanHB · 2025-11-20T15:21:14Z

forgot i can't push to your fork ahah.

Here is what it looks like:

That way we have multiple tasks with different prpompt length and we can tests the limit of the models.
You only have to modify your prompt function to generate a prompt with the given size:

def _build_prompt_and_target(record, prompt_length=32768):
    """
    Helper function to extract common logic for building prompt and target.
    Uses binary search to find the maximum number of items that fit within prompt_length.
    Processes the record and returns prompt, target, and metadata.

    Returns:
        tuple: (prompt: str, target_str: str, metadata: dict)
    """
    input_keys = record["input"]
    input_values = record["values"]
    expected_output = record["output"]

    def build_prompt_for_n(n):
        """Build a prompt with the first n items."""
        if n == 0:
            return None
        keys_n = input_keys[:n]
        values_n = input_values[:n]
        dictionary_n = dict(zip(keys_n, values_n))
        dict_str = str(dictionary_n)
        keys_str = str(keys_n)
        return PROMPT_TEMPLATE.format(dict_str=dict_str, keys_str=keys_str)

    # Binary search to find maximum n that fits within prompt_length
    left, right = 0, len(input_keys)
    max_n = 0

    while left <= right:
        mid = (left + right) // 2
        prompt = build_prompt_for_n(mid)

        if prompt is None:
            break

        if len(prompt) <= prompt_length:
            max_n = mid
            left = mid + 1
        else:
            right = mid - 1

    # Use the maximum n that fits
    input_keys = input_keys[:max_n]
    input_values = input_values[:max_n]
    expected_output = expected_output[:max_n]

    dictionary = dict(zip(input_keys, input_values))
    dict_str = str(dictionary)
    keys_str = str(input_keys)
    prompt = PROMPT_TEMPLATE.format(dict_str=dict_str, keys_str=keys_str)

    target_str = ",".join(map(str, expected_output))

    metadata = {
        "input_keys": input_keys,
        "input_values": input_values,
        "expected_output": expected_output,
        "dictionary": dictionary,
        "num_items": len(input_keys),
    }

    return prompt, target_str, metadata

NathanHB · 2025-11-20T15:22:33Z

and no, your PR was great ! It's just that it makes sense for this eval to have multiple prompt lengths

viciousAegis · 2025-11-20T15:48:15Z

hi! @akshathmangudi @NathanHB, was just taking a look at the implementation here, and while the implementation itself looks correct, I will put down some notes to maybe keep in mind:

this implementation is a "single-turn" implementation, where the model has to output N cumulative sums corresponding to N keys. From our experiments, what we have observed is this is generally harder for models, and they can do much better in a multi-turn setting, where 1 or K keys are provided per turn.
Since this is a "single-turn" implementation, it would be nice if the score could be the "fraction of correct sums" rather than a binary outcome, since that will be a more informative metric. Then the results can be interpreted as "task length: correct fraction on that task length", for every task length in the dataset (as @NathanHB said, it makes sense to have multiple lengths).
To make the output length more manageable, what we decided to do in the experiments in the paper (Section 3.3) was to ask the model to only output the final sum after N keys. This makes it slightly easier for the model as it can now use any aggregation strategy it likes to make it easier to calculate the final sum, which leads to higher scores. Of course, the trade-off is that "per-step" correctness cannot be measured. To be clear, both implementations are valid, but it would be nice to explicitly state that this is a single-turn implementation, as we talk about both single and multi turn settings in the paper, with varying experiments and conclusions.

edit (addition): the multi turn evaluation should be much simpler to implement, as the only differences arise in the loop-calling of the LLM. We implemented an online evaluation, i.e, we updated our metrics after each turn (because the evaluations took a while, and we did not want to wait for all turns to be done!) but it is easy to have an offline evaluation similar (acutally, identical) to the single-turn evaluation.

Thank you for your efforts! I will be happy to discuss more if needed!

akshathmangudi · 2025-11-20T15:48:24Z

ahhh i see.

my initial thought was to have a hardcoded MAX_ITEMS value since it might be too long of a prompt to construct if we add all 50000 examples and truncate our input, output and value keys like input[:MAX_ITEMS]. thanks for letting me know!

and @viciousAegis thank you for the feedback! i will make the necessary fixes and you can let me know what you think!

akshathmangudi · 2025-11-21T08:07:39Z

hi @viciousAegis,

i would like you to review the code now. previously, we had our single-turn implementation with a single file but now it's been split to support single and multi-turn approaches.

for single turn:
lighteval eval <model_name> long_horizon_execution:{context_size}

for multi-turn:
lighteval eval <model_name> long_horizon_execution:{context_size}:k{turn_complexity}.

section 3.3's approach was also integrated into the implementation.

@NathanHB would like your review on this as well.

akshathmangudi · 2025-11-21T10:59:11Z

Tagging #1074 as that's the current PR. Sorry guys

akshathmangudi added 4 commits November 19, 2025 21:45

initial impl for long horizon execution

8ee82ef

Merge branch 'main' into main

ae8de1a

ready for review

bb01770

Merge branch 'main' into akshath/issue-1056

5fd29c6

akshathmangudi marked this pull request as ready for review November 20, 2025 12:48

NathanHB and others added 2 commits November 20, 2025 14:43

Merge branch 'main' into akshath/issue-1056

5dfcf40

Merge branch 'main' into akshath/issue-1056

e69bad3

akshathmangudi added 2 commits November 20, 2025 22:23

addressed nathan's comments, yet to fix reproduction inaccuracies

6362c8b

multi-turn impl + breaking down code to resolve C901 errors

ac79430

akshathmangudi closed this by deleting the head repository Nov 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EVAL] Long Horizon Execution #1072

[EVAL] Long Horizon Execution #1072

Uh oh!

akshathmangudi commented Nov 20, 2025 •

edited

Loading

Uh oh!

akshathmangudi commented Nov 20, 2025

Uh oh!

NathanHB commented Nov 20, 2025 •

edited

Loading

Uh oh!

akshathmangudi commented Nov 20, 2025

Uh oh!

NathanHB commented Nov 20, 2025

Uh oh!

NathanHB commented Nov 20, 2025

Uh oh!

viciousAegis commented Nov 20, 2025 •

edited

Loading

Uh oh!

akshathmangudi commented Nov 20, 2025 •

edited

Loading

Uh oh!

akshathmangudi commented Nov 21, 2025

Uh oh!

akshathmangudi commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[EVAL] Long Horizon Execution #1072

[EVAL] Long Horizon Execution #1072

Uh oh!

Conversation

akshathmangudi commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshathmangudi commented Nov 20, 2025

Uh oh!

NathanHB commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshathmangudi commented Nov 20, 2025

Uh oh!

NathanHB commented Nov 20, 2025

Uh oh!

NathanHB commented Nov 20, 2025

Uh oh!

viciousAegis commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshathmangudi commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshathmangudi commented Nov 21, 2025

Uh oh!

akshathmangudi commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akshathmangudi commented Nov 20, 2025 •

edited

Loading

NathanHB commented Nov 20, 2025 •

edited

Loading

viciousAegis commented Nov 20, 2025 •

edited

Loading

akshathmangudi commented Nov 20, 2025 •

edited

Loading