Skip to content

Qihoo360/LLMs-Generalization-Test

Repository files navigation

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

Language: 中文

Repository Overview

This repository hosts the code, data preparation scripts, and evaluation pipelines for the paper Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements (EMNLP 2025). The resources here accompany the analyses reported in the camera-ready paper: Link.

Repository Layout

  • option_length/ — evaluation pipelines stressing answer-option length variations for MMLU & ARC benchmarks.
  • question_type/ — tools for rewriting multiple-choice questions into alternative formats (e.g., boolean questions).
  • irrelvant_nouns/ — GSM8K noun replacement stress tests with ready-to-use evaluation scripts and datasets.

Getting Started

  1. Create a dedicated environment (example using Conda).
  2. Install the bundled evaluation harness (lm-evaluation-harness) in editable mode.
  3. Install any experiment-specific extras noted in the subdirectory READMEs, then follow the instructions below.
conda create -n lm_eval python=3.12 -y
conda activate lm_eval
pip install -e ./lm-evaluation-harness

Experiments

Option Length Stress Tests

Located in option_length/. Use paraphrase/mmlu.py and paraphrase/arc.py to generate rewritten datasets, or download the packaged datasets.tar.gz from Google Drive (Link). Evaluate models via the scripts in option_length/scripts/ (eval_mmlu.sh, eval_arc.sh, eval_mmlu_vary.sh). Results are stored under option_length/results/.

Problem Type Conversions

The question_type/ directory provides utilities for rewriting MMLU-style multiple-choice questions into boolean formats. Execute python make_bq.py inside the dataset directory to regenerate the boolean questions when needed.

Irrelevant Noun Replacement

The irrelvant_nouns/ folder contains preprocessed GSM8K datasets featuring noun substitutions with varying semantic drift. Launch the full evaluation with bash run_evaluate.sh. Pre-generation scripts are available under irrelvant_nouns/preprocess_data/, but rerunning them is optional.

Result Snapshot

Representative outcomes on option-length perturbations (MMLU / ARC). Detailed logs live in option_length/results/.

Benchmark Model Origin RL WL
MMLU Qwen2.5 1.5B 60.3 89.0 36.3
Qwen2.5 7B 73.7 90.1 55.6
Qwen2.5 72B 85.4 94.1 75.6
LLaMa3.1 8B 65.5 85.6 53.6
LLaMa3.1 70B 78.8 93.6 70.6
GPT4o mini 76.5 87.2 70.6
GPT4o 85.2 89.7 83.3
ARC-C Qwen2.5 1.5B 77.3 88.9 68.1
Qwen2.5 7B 90.0 94.3 84.0
Qwen2.5 72B 95.8 97.2 94.4
LLaMa3.1 8B 78.1 85.2 74.7
LLaMa3.1 70B 91.8 96.3 90.8
GPT4o mini 91.8 95.1 91.4
GPT4o 96.5 97.1 95.5

Origin: refers to the original MMLU and ARC-C benchmarks. RL: refers to lengthening the right option. WL: refers to lengthening the wrong option.

Citation

If you use this repository, please cite the paper below.

@inproceedings{paperpitfall2025,
  title     = {Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements},
  author    = {Guangxiang Zhao, Saier Hu, Xiaoqi Jian, Jinzhu Wu, Yuhan Wu, Change Jia, Lin Sun, Xiangzheng Zhang},
  booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year      = {2025},
  url       = {https://arxiv.org/abs/2502.12459}
}

Contact

For questions or collaboration requests, feel free to open an issue or reach out via the email listed in the paper.

Acknowledgments

This repository builds on EleutherAI's lm-evaluation-harness (v0.4.3).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published