Skip to content

Conversation

@akshathmangudi
Copy link
Contributor

@akshathmangudi akshathmangudi commented Nov 20, 2025

Approach described within #1056.

Tasks:

  • Initial scaffolding of /tasks/tasks/long_horizon_execution.py
  • Implement a custom scorer to parse <answer> tags.
  • Complete implementation of /tasks/tasks/long_horizon_execution.py
  • Evaluation and Testing

STATUS: ready for review.

Current behavior:

When we run lighteval tasks inspect long_horizon_execution, the output has been shown below:

... more lines
           "'basic', 'alive', 'cream', 'dress', 'black', 'brown', 'drama', "
           "'black', 'audio', 'brown', 'album', 'cover', 'avoid', 'aware', "
           "'event', 'dream', 'clean', 'clock', 'apple', 'above', 'close', "
           "'begin', 'allow', 'album', 'draft', 'brain', 'civil', 'faith', "
           "'death', 'coach', 'below', 'doubt', 'aware', 'cover', 'final', "
           "'allow', 'avoid', 'ahead', 'cross', 'child', 'cream', 'error', "
           "'break', 'brief', 'clock', 'final', 'dance', 'award', 'every', "
           "'chief', 'could', 'dream', 'begin', 'burst', 'audio', 'album', "
           "'cross', 'doubt', 'blood', 'child', 'brand', 'brand', 'extra', "
           "'broad', 'cloud', 'check', 'after', 'chart', 'basic', 'child', "
           "'coach', 'chair', 'faith', 'earth', 'audio', 'basic', 'field', "
           "'cloud', 'draft', 'apply', 'court', 'black', 'ahead', 'burst', "
           "'crowd', 'depth', 'enemy', 'drink', 'first', 'could', 'false', "
           "'could', 'blame', 'first', 'album', 'crowd', 'first', 'broad', "
           "'extra', 'clock', 'chart', 'fiber', 'board', 'earth', 'being', "
           "'alive', 'chart', 'avoid', 'dress', 'cloud', 'clean', 'avoid', "
           "'crash', 'clean', 'arise', 'death', 'brand', 'error']\n"
           '\n'
           'Your task: Calculate the cumulative sum after each key. The first '
           'sum is just the value of the first key. The second sum is the '
           'first value plus the second value, and so on.\n'
           '\n'
           'IMPORTANT:\n'
           '- Output your answer as a single line with comma-separated values '
           'inside <answer></answer> tags\n'
           '- Do not include any other text outside the answer tags\n'
           '- Format: <answer>value1,value2,value3,...</answer>\n'
           '- Example: If the cumulative sums are [5, 8, 12], output: '
           '<answer>5,8,12</answer>\n'
           '\n'
           'Your answer:',
  'sampling_methods': [],
  'specific': None,
  'stop_sequences': (),
  'task_name': 'long_horizon_execution',
  'unconditioned_query': None,
  'use_logits': False}

@akshathmangudi
Copy link
Contributor Author

Tagging #1069 for better readability.

@akshathmangudi akshathmangudi marked this pull request as ready for review November 20, 2025 12:48
@NathanHB
Copy link
Member

NathanHB commented Nov 20, 2025

great !! Pulled and tested it, I modified a few things (mainly just to have multiple prompt lengths) and pushed a log dir to check the prompt and results.

@akshathmangudi are you ok with me pushing them ?

https://huggingface.co/spaces/SaylorTwift/long_horizon_execution

cc: @shash42 does it look good to you ? :)

@akshathmangudi
Copy link
Contributor Author

yes ofc, is this something that i will have to keep note of in future PRs as well when i integrate benchmarks?

@NathanHB
Copy link
Member

forgot i can't push to your fork ahah.

Here is what it looks like:

Screenshot 2025-11-20 at 16 19 03

That way we have multiple tasks with different prpompt length and we can tests the limit of the models.
You only have to modify your prompt function to generate a prompt with the given size:

def _build_prompt_and_target(record, prompt_length=32768):
    """
    Helper function to extract common logic for building prompt and target.
    Uses binary search to find the maximum number of items that fit within prompt_length.
    Processes the record and returns prompt, target, and metadata.

    Returns:
        tuple: (prompt: str, target_str: str, metadata: dict)
    """
    input_keys = record["input"]
    input_values = record["values"]
    expected_output = record["output"]

    def build_prompt_for_n(n):
        """Build a prompt with the first n items."""
        if n == 0:
            return None
        keys_n = input_keys[:n]
        values_n = input_values[:n]
        dictionary_n = dict(zip(keys_n, values_n))
        dict_str = str(dictionary_n)
        keys_str = str(keys_n)
        return PROMPT_TEMPLATE.format(dict_str=dict_str, keys_str=keys_str)

    # Binary search to find maximum n that fits within prompt_length
    left, right = 0, len(input_keys)
    max_n = 0

    while left <= right:
        mid = (left + right) // 2
        prompt = build_prompt_for_n(mid)

        if prompt is None:
            break

        if len(prompt) <= prompt_length:
            max_n = mid
            left = mid + 1
        else:
            right = mid - 1

    # Use the maximum n that fits
    input_keys = input_keys[:max_n]
    input_values = input_values[:max_n]
    expected_output = expected_output[:max_n]

    dictionary = dict(zip(input_keys, input_values))
    dict_str = str(dictionary)
    keys_str = str(input_keys)
    prompt = PROMPT_TEMPLATE.format(dict_str=dict_str, keys_str=keys_str)

    target_str = ",".join(map(str, expected_output))

    metadata = {
        "input_keys": input_keys,
        "input_values": input_values,
        "expected_output": expected_output,
        "dictionary": dictionary,
        "num_items": len(input_keys),
    }

    return prompt, target_str, metadata

@NathanHB
Copy link
Member

and no, your PR was great ! It's just that it makes sense for this eval to have multiple prompt lengths

@viciousAegis
Copy link

viciousAegis commented Nov 20, 2025

hi! @akshathmangudi @NathanHB, was just taking a look at the implementation here, and while the implementation itself looks correct, I will put down some notes to maybe keep in mind:

  1. this implementation is a "single-turn" implementation, where the model has to output N cumulative sums corresponding to N keys. From our experiments, what we have observed is this is generally harder for models, and they can do much better in a multi-turn setting, where 1 or K keys are provided per turn.
  2. Since this is a "single-turn" implementation, it would be nice if the score could be the "fraction of correct sums" rather than a binary outcome, since that will be a more informative metric. Then the results can be interpreted as "task length: correct fraction on that task length", for every task length in the dataset (as @NathanHB said, it makes sense to have multiple lengths).
  3. To make the output length more manageable, what we decided to do in the experiments in the paper (Section 3.3) was to ask the model to only output the final sum after N keys. This makes it slightly easier for the model as it can now use any aggregation strategy it likes to make it easier to calculate the final sum, which leads to higher scores. Of course, the trade-off is that "per-step" correctness cannot be measured. To be clear, both implementations are valid, but it would be nice to explicitly state that this is a single-turn implementation, as we talk about both single and multi turn settings in the paper, with varying experiments and conclusions.

edit (addition): the multi turn evaluation should be much simpler to implement, as the only differences arise in the loop-calling of the LLM. We implemented an online evaluation, i.e, we updated our metrics after each turn (because the evaluations took a while, and we did not want to wait for all turns to be done!) but it is easy to have an offline evaluation similar (acutally, identical) to the single-turn evaluation.

Thank you for your efforts! I will be happy to discuss more if needed!

@akshathmangudi
Copy link
Contributor Author

akshathmangudi commented Nov 20, 2025

ahhh i see.

my initial thought was to have a hardcoded MAX_ITEMS value since it might be too long of a prompt to construct if we add all 50000 examples and truncate our input, output and value keys like input[:MAX_ITEMS]. thanks for letting me know!

and @viciousAegis thank you for the feedback! i will make the necessary fixes and you can let me know what you think!

@akshathmangudi
Copy link
Contributor Author

hi @viciousAegis,

i would like you to review the code now. previously, we had our single-turn implementation with a single file but now it's been split to support single and multi-turn approaches.

for single turn:
lighteval eval <model_name> long_horizon_execution:{context_size}

for multi-turn:
lighteval eval <model_name> long_horizon_execution:{context_size}:k{turn_complexity}.

section 3.3's approach was also integrated into the implementation.

@NathanHB would like your review on this as well.

@akshathmangudi akshathmangudi closed this by deleting the head repository Nov 21, 2025
@akshathmangudi
Copy link
Contributor Author

Tagging #1074 as that's the current PR. Sorry guys

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants