Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements
Language: 中文
This repository hosts the code, data preparation scripts, and evaluation pipelines for the paper Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements (EMNLP 2025). The resources here accompany the analyses reported in the camera-ready paper: Link.
option_length/
— evaluation pipelines stressing answer-option length variations for MMLU & ARC benchmarks.question_type/
— tools for rewriting multiple-choice questions into alternative formats (e.g., boolean questions).irrelvant_nouns/
— GSM8K noun replacement stress tests with ready-to-use evaluation scripts and datasets.
- Create a dedicated environment (example using Conda).
- Install the bundled evaluation harness (
lm-evaluation-harness
) in editable mode. - Install any experiment-specific extras noted in the subdirectory READMEs, then follow the instructions below.
conda create -n lm_eval python=3.12 -y
conda activate lm_eval
pip install -e ./lm-evaluation-harness
Located in option_length/
. Use paraphrase/mmlu.py
and paraphrase/arc.py
to generate rewritten datasets, or download the packaged datasets.tar.gz
from Google Drive (Link). Evaluate models via the scripts in option_length/scripts/
(eval_mmlu.sh
, eval_arc.sh
, eval_mmlu_vary.sh
). Results are stored under option_length/results/
.
The question_type/
directory provides utilities for rewriting MMLU-style multiple-choice questions into boolean formats. Execute python make_bq.py
inside the dataset directory to regenerate the boolean questions when needed.
The irrelvant_nouns/
folder contains preprocessed GSM8K datasets featuring noun substitutions with varying semantic drift. Launch the full evaluation with bash run_evaluate.sh
. Pre-generation scripts are available under irrelvant_nouns/preprocess_data/
, but rerunning them is optional.
Representative outcomes on option-length perturbations (MMLU / ARC). Detailed logs live in option_length/results/
.
Benchmark | Model | Origin | RL | WL |
---|---|---|---|---|
MMLU | Qwen2.5 1.5B | 60.3 | 89.0 | 36.3 |
Qwen2.5 7B | 73.7 | 90.1 | 55.6 | |
Qwen2.5 72B | 85.4 | 94.1 | 75.6 | |
LLaMa3.1 8B | 65.5 | 85.6 | 53.6 | |
LLaMa3.1 70B | 78.8 | 93.6 | 70.6 | |
GPT4o mini | 76.5 | 87.2 | 70.6 | |
GPT4o | 85.2 | 89.7 | 83.3 | |
ARC-C | Qwen2.5 1.5B | 77.3 | 88.9 | 68.1 |
Qwen2.5 7B | 90.0 | 94.3 | 84.0 | |
Qwen2.5 72B | 95.8 | 97.2 | 94.4 | |
LLaMa3.1 8B | 78.1 | 85.2 | 74.7 | |
LLaMa3.1 70B | 91.8 | 96.3 | 90.8 | |
GPT4o mini | 91.8 | 95.1 | 91.4 | |
GPT4o | 96.5 | 97.1 | 95.5 |
Origin: refers to the original MMLU and ARC-C benchmarks. RL: refers to lengthening the right option. WL: refers to lengthening the wrong option.
If you use this repository, please cite the paper below.
@inproceedings{paperpitfall2025,
title = {Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements},
author = {Guangxiang Zhao, Saier Hu, Xiaoqi Jian, Jinzhu Wu, Yuhan Wu, Change Jia, Lin Sun, Xiangzheng Zhang},
booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
year = {2025},
url = {https://arxiv.org/abs/2502.12459}
}
For questions or collaboration requests, feel free to open an issue or reach out via the email listed in the paper.
This repository builds on EleutherAI's lm-evaluation-harness (v0.4.3).