-
Notifications
You must be signed in to change notification settings - Fork 391
[EVAL] Long Horizon Execution #1074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[EVAL] Long Horizon Execution #1074
Conversation
|
cc: @NathanHB |
|
looking good ! Will run locally and review today or start of next week :) |
|
i ran the benchmark on HF Inference's https://huggingface.co/spaces/akshathmangudi/lhe-gpt4o-single |
…hteval into akshath/issue-1056-v2
NathanHB
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey ! Thanks for the hard work on this, i'm testing it locally right now. I have some small nits but it's looking almost ready !
src/lighteval/tasks/tasks/long_horizon_execution/single_turn.py
Outdated
Show resolved
Hide resolved
NathanHB
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on single turn, working great with the few nits I added above. However i cannot seems to make the multiturn work, can you ping when it's ready?
|
@NathanHB it should be working now, ive created a link below that tests both single and multi-turn. |
I screwed up my previous git clone, so I had to redo the changes 😅
Description:
Approach described within #1056.
Tasks:
/tasks/tasks/long_horizon_execution.py<answer>tags./tasks/tasks/long_horizon_execution.pySTATUS: ready for review.
Current behavior:
When we run
lighteval tasks inspect long_horizon_execution, the output has been shown below: