Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

Language: 中文

Repository Overview

This repository hosts the code, data preparation scripts, and evaluation pipelines for the paper Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements (EMNLP 2025). The resources here accompany the analyses reported in the camera-ready paper: Link.

Repository Layout

option_length/ — evaluation pipelines stressing answer-option length variations for MMLU & ARC benchmarks.
question_type/ — tools for rewriting multiple-choice questions into alternative formats (e.g., boolean questions).
irrelvant_nouns/ — GSM8K noun replacement stress tests with ready-to-use evaluation scripts and datasets.

Getting Started

Create a dedicated environment (example using Conda).
Install the bundled evaluation harness (lm-evaluation-harness) in editable mode.
Install any experiment-specific extras noted in the subdirectory READMEs, then follow the instructions below.

conda create -n lm_eval python=3.12 -y
conda activate lm_eval
pip install -e ./lm-evaluation-harness

Experiments

Option Length Stress Tests

Located in option_length/. Use paraphrase/mmlu.py and paraphrase/arc.py to generate rewritten datasets, or download the packaged datasets.tar.gz from Google Drive (Link). Evaluate models via the scripts in option_length/scripts/ (eval_mmlu.sh, eval_arc.sh, eval_mmlu_vary.sh). Results are stored under option_length/results/.

Problem Type Conversions

The question_type/ directory provides utilities for rewriting MMLU-style multiple-choice questions into boolean formats. Execute python make_bq.py inside the dataset directory to regenerate the boolean questions when needed.

Irrelevant Noun Replacement

The irrelvant_nouns/ folder contains preprocessed GSM8K datasets featuring noun substitutions with varying semantic drift. Launch the full evaluation with bash run_evaluate.sh. Pre-generation scripts are available under irrelvant_nouns/preprocess_data/, but rerunning them is optional.

Result Snapshot

Representative outcomes on option-length perturbations (MMLU / ARC). Detailed logs live in option_length/results/.

Benchmark	Model	Origin	RL	WL
MMLU	Qwen2.5 1.5B	60.3	89.0	36.3
	Qwen2.5 7B	73.7	90.1	55.6
	Qwen2.5 72B	85.4	94.1	75.6
	LLaMa3.1 8B	65.5	85.6	53.6
	LLaMa3.1 70B	78.8	93.6	70.6
	GPT4o mini	76.5	87.2	70.6
	GPT4o	85.2	89.7	83.3
ARC-C	Qwen2.5 1.5B	77.3	88.9	68.1
	Qwen2.5 7B	90.0	94.3	84.0
	Qwen2.5 72B	95.8	97.2	94.4
	LLaMa3.1 8B	78.1	85.2	74.7
	LLaMa3.1 70B	91.8	96.3	90.8
	GPT4o mini	91.8	95.1	91.4
	GPT4o	96.5	97.1	95.5

Origin: refers to the original MMLU and ARC-C benchmarks. RL: refers to lengthening the right option. WL: refers to lengthening the wrong option.

Citation

If you use this repository, please cite the paper below.

@inproceedings{paperpitfall2025,
  title     = {Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements},
  author    = {Guangxiang Zhao, Saier Hu, Xiaoqi Jian, Jinzhu Wu, Yuhan Wu, Change Jia, Lin Sun, Xiangzheng Zhang},
  booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year      = {2025},
  url       = {https://arxiv.org/abs/2502.12459}
}

Contact

For questions or collaboration requests, feel free to open an issue or reach out via the email listed in the paper.

Acknowledgments

This repository builds on EleutherAI's lm-evaluation-harness (v0.4.3).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
irrelvant_nouns		irrelvant_nouns
lm-evaluation-harness		lm-evaluation-harness
option_length		option_length
question_type		question_type
.DS_Store		.DS_Store
.gitignore		.gitignore
README-zh.md		README-zh.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

Repository Overview

Repository Layout

Getting Started

Experiments

Option Length Stress Tests

Problem Type Conversions

Irrelevant Noun Replacement

Result Snapshot

Citation

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Qihoo360/LLMs-Generalization-Test

Folders and files

Latest commit

History

Repository files navigation

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

Repository Overview

Repository Layout

Getting Started

Experiments

Option Length Stress Tests

Problem Type Conversions

Irrelevant Noun Replacement

Result Snapshot

Citation

Contact

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages