This repository contains the official implementation of "Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders", accepted at ACL 2025.
git clone https://github.com/username/multilingual-llm-features.git
cd multilingual-llm-features
pip install -r requirements.txtIf you plan to work with the Llama-3.1-8B SAE, you'll need to install additional dependencies following the guidelines from OpenMOSS/Language-Model-SAEs:
To download the pre-trained Sparse Autoencoders (SAEs), simply run:
python download.pyThis will automatically download the following SAE models in the SAE directory:
Llama3_1-8B-Base-LXR-8xgemma-scope-2b-pt-resgemma-scope-9b-pt-res
The directory structure after downloading will be:
SAE/
├── Llama3_1-8B-Base-LXR-8x/
├── gemma-scope-2b-pt-res/
└── gemma-scope-9b-pt-res/
Note: Make sure you have sufficient disk space and a stable internet connection before running the download script.
If you encounter network issues, you can uncomment the alternative download URLs in download.py to use mirror sites for downloading the models.
To identify language-specific features in LLMs, follow these steps (an example for gemma-2-2b):
# Step 1: Compute latent representations of language features
python latent_computation.py --model_name "gemma-2-2b" --model_path YOUR_MODEL_PATH
# For additional arguments, please refer to utils.py
# Step 2: Analyze and extract language-specific features
python latent_analysis.py --model_name "gemma-2-2b" --model_path YOUR_MODEL_PATHThe results will be saved in the sae_acts directory with the following structure:
sae_acts/
└── gemma-2b/
└── layer_3/
├── top_index_per_lan_magnitude.pth # Language-specific features ranked by magnitude
├── sae_acts.pth # Intermediate results
└── ...
Note: top_index_per_lan_magnitude.pth has dimensions of 10 × feature_num, where 10 represents different languages in the following order: ['en', 'es', 'fr', 'ja', 'ko', 'pt', 'th', 'vi', 'zh', 'ar'] (English, Spanish, French, Japanese, Korean, Portuguese, Thai, Vietnamese, Chinese, Arabic).
To reproduce the code-switching analysis results:
- Execute
code_switch_analysis()function ininference.py - Run
code_switch_analysis2()function ininference.py
The visualization results will be automatically saved in the plot directory.
To reproduce the analysis of language-specific feature ablation:
-
Run
change_activation_print_ce_corpus_gen()function ininference.pywith the following parameter combinations:--start_idx 0 --topk_feature_num 1--start_idx 0 --topk_feature_num 2--start_idx 1 --topk_feature_num 1
-
Generate visualizations by running:
change_activation_print_ce_corpus_different_same_feature_diff_lan_all()change_activation_print_ce_corpus()functions inplot.py
All visualization outputs will be stored in the plot directory for further analysis.
If you find this work helpful, please cite our paper:
@article{deng2025unveiling,
title={Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders},
author={Deng, Boyi and Wan, Yu and Zhang, Yidan and Yang, Baosong and Feng, Fuli},
journal={arXiv preprint arXiv:2505.05111},
year={2025}
}For any questions or feedback, feel free to reach out to Boyi Deng at [email protected].