-
Notifications
You must be signed in to change notification settings - Fork 391
[EVAL] Long Horizon Execution #1072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EVAL] Long Horizon Execution #1072
Conversation
|
Tagging #1069 for better readability. |
|
great !! Pulled and tested it, I modified a few things (mainly just to have multiple prompt lengths) and pushed a log dir to check the prompt and results. @akshathmangudi are you ok with me pushing them ? https://huggingface.co/spaces/SaylorTwift/long_horizon_execution cc: @shash42 does it look good to you ? :) |
|
yes ofc, is this something that i will have to keep note of in future PRs as well when i integrate benchmarks? |
|
and no, your PR was great ! It's just that it makes sense for this eval to have multiple prompt lengths |
|
hi! @akshathmangudi @NathanHB, was just taking a look at the implementation here, and while the implementation itself looks correct, I will put down some notes to maybe keep in mind:
edit (addition): the multi turn evaluation should be much simpler to implement, as the only differences arise in the loop-calling of the LLM. We implemented an online evaluation, i.e, we updated our metrics after each turn (because the evaluations took a while, and we did not want to wait for all turns to be done!) but it is easy to have an offline evaluation similar (acutally, identical) to the single-turn evaluation. Thank you for your efforts! I will be happy to discuss more if needed! |
|
ahhh i see. my initial thought was to have a hardcoded MAX_ITEMS value since it might be too long of a prompt to construct if we add all 50000 examples and truncate our input, output and value keys like and @viciousAegis thank you for the feedback! i will make the necessary fixes and you can let me know what you think! |
|
hi @viciousAegis, i would like you to review the code now. previously, we had our single-turn implementation with a single file but now it's been split to support single and multi-turn approaches. for single turn: for multi-turn: section 3.3's approach was also integrated into the implementation. @NathanHB would like your review on this as well. |
|
Tagging #1074 as that's the current PR. Sorry guys |

Approach described within #1056.
Tasks:
/tasks/tasks/long_horizon_execution.py<answer>tags./tasks/tasks/long_horizon_execution.pySTATUS: ready for review.
Current behavior:
When we run
lighteval tasks inspect long_horizon_execution, the output has been shown below: