Skip to content

A GitHub issue resolution benchmark with multi-aspect diversity in programming languages, repository domains and modality of input information. (ISSTA'25)

License

Notifications You must be signed in to change notification settings

DeepSoftwareAnalytics/OmniGIRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SVG Banners

👉🏻 OmniGIRL 👈🏻

🌐 Website • 🤗 Hugging Face • 🐋 Env Docker Image • 📃 arXiv Paper · 📓 ISSTA 2025

✨ Key Features

  • 🚀 Convenient, Standardized Evaluation Environment

    Provide Pre-built Docker images, significantly simplifying the environment setup process and guaranteeing the consistency and reproducibility of evaluation tests.

  • 🕸 Extensive Programming Language Coverage

    Support Python, Java, JavaScript, and TypeScript, ensuring effective evaluation across these four major programming language ecosystems.

  • 🗂️ Rich Multimodal Input Data

    Integrate diverse modalities (text, web content, and images), requiring evaluated models to understand and leverage information from all sources to effectively resolve issues.

  • Automatic Environment Setup & Dataset Construction Tool

    We introduce SWE-Factory, an automatic issue-resolution benchmark construction pipeline based on a multi-agent framework. For more information and the full source code, visit: SWE-Factory.


📦 Environment Setup

To get started, run the bash script below to set up the environment:

bash setup.sh

🚀 Running Evaluations

After setup the environment, you need to do following things to run evaluation:

  1. Prepare Prediction file: Some patch files in JSONL format, each item containing:

    • model_name_or_path: Model Name
    • instance_id: Task Instance id
    • prediction_patch: Prediction Patch Content

    Example:

    {
        "model_name_or_path": "agentless-v1",
        "instance_id": "prettier__prettier-12260",
        "model_patch": "diff --git ...."
    }
  2. Move to omnigirl/harness, then you can run the evaluation using the following command:

    # required
    cd omnigirl/harness
    
    python run_evaluation.py --predictions_path <path of your prediction results> \
                             --max_workers <number of workers> \
                             --run_id <unique id number of this evaluation>
  3. By default, your evaluation results will be generated in omnigirl/harness/reports.

  4. For the detailed tutorial about evaluation, please refer to omnigirl/harness directory

  5. Evaluation is recommended to be run on machines with amd64 architecture, consistent with the evaluation environment in the paper.

📖 Citation

If you find OmniGIRL useful for your research and applications, feel free to give us a star ⭐ or cite us using:

@inproceedings{guo2025omnigirl,
  title={OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution},
  author={Guo, Lianghong and Tao, Wei and Jiang, Runhan and Wang, Yanlin and Chen, Jiachi and Liu, Xilin and Ma, Yuchi and Mao, Mingzhi and Zhang, Hongyu and Zheng, Zibin},
  booktitle={Proceedings of the 34rd ACM SIGSOFT International Symposium on Software Testing and Analysis},
  year={2025},
  publisher={{ACM}},
}

🙏 Acknowledgements

  • We build on prior work — SWE-bench, Agentless, and AutoCodeRover — which laid the groundwork for this study.
  • We thank the EvalPlus leaderboard team for releasing the elegant page template that inspired this site.
  • Finally, we are grateful to the open-source developer community for their invaluable contributions.

About

A GitHub issue resolution benchmark with multi-aspect diversity in programming languages, repository domains and modality of input information. (ISSTA'25)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •