🌐 Website • 🤗 Hugging Face • 🐋 Env Docker Image • 📃 arXiv Paper · 📓 ISSTA 2025
-
🚀 Convenient, Standardized Evaluation Environment
Provide Pre-built Docker images, significantly simplifying the environment setup process and guaranteeing the consistency and reproducibility of evaluation tests.
-
🕸 Extensive Programming Language Coverage
Support Python, Java, JavaScript, and TypeScript, ensuring effective evaluation across these four major programming language ecosystems.
-
🗂️ Rich Multimodal Input Data
Integrate diverse modalities (text, web content, and images), requiring evaluated models to understand and leverage information from all sources to effectively resolve issues.
-
⚒ Automatic Environment Setup & Dataset Construction Tool
We introduce SWE-Factory, an automatic issue-resolution benchmark construction pipeline based on a multi-agent framework. For more information and the full source code, visit: SWE-Factory.
To get started, run the bash script below to set up the environment:
bash setup.sh
After setup the environment, you need to do following things to run evaluation:
-
Prepare Prediction file: Some patch files in JSONL format, each item containing:
model_name_or_path
: Model Nameinstance_id
: Task Instance idprediction_patch
: Prediction Patch Content
Example:
{ "model_name_or_path": "agentless-v1", "instance_id": "prettier__prettier-12260", "model_patch": "diff --git ...." }
-
Move to omnigirl/harness, then you can run the evaluation using the following command:
# required cd omnigirl/harness python run_evaluation.py --predictions_path <path of your prediction results> \ --max_workers <number of workers> \ --run_id <unique id number of this evaluation>
-
By default, your evaluation results will be generated in omnigirl/harness/reports.
-
For the detailed tutorial about evaluation, please refer to omnigirl/harness directory
-
Evaluation is recommended to be run on machines with amd64 architecture, consistent with the evaluation environment in the paper.
If you find OmniGIRL useful for your research and applications, feel free to give us a star ⭐ or cite us using:
@inproceedings{guo2025omnigirl,
title={OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution},
author={Guo, Lianghong and Tao, Wei and Jiang, Runhan and Wang, Yanlin and Chen, Jiachi and Liu, Xilin and Ma, Yuchi and Mao, Mingzhi and Zhang, Hongyu and Zheng, Zibin},
booktitle={Proceedings of the 34rd ACM SIGSOFT International Symposium on Software Testing and Analysis},
year={2025},
publisher={{ACM}},
}
- We build on prior work — SWE-bench, Agentless, and AutoCodeRover — which laid the groundwork for this study.
- We thank the EvalPlus leaderboard team for releasing the elegant page template that inspired this site.
- Finally, we are grateful to the open-source developer community for their invaluable contributions.