This repository contains code for analyzing the representations produced by contextual language models from a topological perspective. In particular, we study changes in the local intrinsic dimension (LID) of the model's hidden states in different scenarios.
Details can be found in our paper Less is More: Local Intrinsic Dimensions of Contextual Language Models.
Abstract:
Understanding the internal mechanisms of large language models (LLMs) remains a challenging and complex endeavor. Even fundamental questions, such as how fine-tuning affects model behavior, often require extensive empirical evaluation. In this paper, we introduce a novel perspective based on the geometric properties of contextual latent embeddings to study the effects of training and fine-tuning. To that end, we measure the local dimensions of a contextual language model's latent space and analyze their shifts during training and fine-tuning. We show that the local dimensions provide insights into the model's training dynamics and generalization ability. Specifically, the mean of the local dimensions predicts when the model's training capabilities are exhausted, as exemplified in a dialogue state tracking task, overfitting, as demonstrated in an emotion recognition task, and grokking, as illustrated with an arithmetic task. Furthermore, our experiments suggest a practical heuristic: reductions in the mean local dimension tend to accompany and predict subsequent performance gains. Through this exploration, we aim to provide practitioners with a deeper understanding of the implications of fine-tuning on embedding spaces, facilitating informed decisions when configuring models for specific applications. The results of this work contribute to the ongoing discourse on the interpretability, adaptability, and generalizability of LLMs by bridging the gap between intrinsic model mechanisms and geometric properties in the respective embeddings.
# Clone the repository and navigate to the directory
git clone [REPOSITORY_URL] Topo_LLM
cd Topo_LLM
# Call the setup script to set the environment variables
./topollm/setup/setup_environment.sh
# Run the pipeline to compute local estimates with default parameters
uv run pipeline_local_estimates
- Python 3.12
uv
package manager
On MacOS, you can install uv
with Homebrew:
brew install uv
.
Install pipx
via apt
and then install uv
via pipx
:
sudo apt update
sudo apt install pipx
pipx ensurepath
pipx install uv
- Install python version with
uv
:
uv python install 3.12
- You can check the installed python versions with:
uv python list
- Lock the dependencies with
uv
and sync the environment:
uv lock
uv sync
- Start a python interpreter with the local environment:
uv run python3
On some HPC clusters, you might need to pin a torch version in the pyproject.toml
file, to make the installation of torch and a compatible CUDA version work.
For example, on our HPC cluster, it currently appears to work when you set the torch version to 2.3.*
:
torch = "2.3.*"
- Set the correct environment variables used in the project config.
This step can be achieved by running the setup script in the
topollm/setup/
directory once.
./topollm/setup/setup_environment.sh
-
If required, e.g. when running jobs on an HPC cluster, set the correct environment variables in the
.env
file in the project root directory. -
For setting up the repository to support job submissions to the a HPC cluster, follow the instructions using our Hydra HPC Launcher. Additional submission scripts are located in the
topollm/scripts/submission_scripts
directory. -
Download the files necessary for
nltk
: Start a python interpreter and run the following:
>>> import nltk
>>> nltk.download('punkt_tab')
>>> nltk.download('averaged_perceptron_tagger_eng')
We use Hydra for the config managment.
Please see the documentation and the experiments below for examples on how to use the config files in configs
and command line overrides.
The data directory is set in most of the python scripts via the Hydra config (see the script topollm/config_classes/get_data_dir.py
for a common function to access the data directory path).
We additionally set the path to the local data directory in the .env
file in the project root directory, in the variable LOCAL_TOPO_LLM_DATA_DIR
.
Most of the shell scripts use this variable to set the data directory path.
For compatibility, please make sure that these paths are set correctly and point to the same directory.
The following datasets can be used via their config file in configs/data
:
multiwoz21.yaml
: MultiWOZ2.1; HuggingFaceertod_emowoz.yaml
: EmoWOZ; HuggingFacetrippy_r_dataloaders_processed.yaml
: TripPy-R training, validation, test data; GitLabsgd.yaml
: SGD; GitHub, HuggingFacewikitext-103-v1.yaml
: Wikipedia; HuggingFaceone-year-of-tsla-on-reddit.yaml
: Reddit; HuggingFace
To prepare the dialogue datasets, which will be saved as .jsonl
files in the directory data/datasets/dialogue_datasets
, use the script topollm/data_processing/prepare_dialogue_data.py
.
The dialogue dataset preparation requires an installation of ConvLab-3, which is not included in the uv
environment to avoid version conflicts.
We provide uv run
commands in the pyproject.toml
file for the most important entry points of the module.
Most importantly, the following command runs the full pipeline:
- from computing embeddings,
- to the embedding data preparation (collecting token embeddings, removing padding token embedding),
- and finally computing the local estimates (for example, the TwoNN-based local dimension).
uv run pipeline_local_estimates
Usually, you will want to select a specific dataset and model to run the pipeline on, and change additional hyperparameters.
You can do this by passing the parameters as command line arguments to the uv run
command:
uv run pipeline_local_estimates \
data="wikitext-103-v1" \
data.data_subsampling.number_of_samples=512 \
data.data_subsampling.sampling_mode="random" \
data.data_subsampling.split="validation" \
data.data_subsampling.sampling_seed=778 \
language_model="roberta-base" \
embeddings.embedding_data_handler.mode="regular" \
embeddings_data_prep.sampling.num_samples=3000 \
local_estimates=twonn \
local_estimates.filtering.num_samples=500 \
local_estimates.pointwise.absolute_n_neighbors=128
The parameters in the command line override the default parameters in the config file. Here, we explain the parameters in the example command:
data="wikitext-103-v1"
: Compute local estimates for the Wikipedia dataset.data.data_subsampling.number_of_samples=512
: Sample 512 sequences from the dataset (i.e., setM=512
as size of the text corpus sequence sub-sample).data.data_subsampling.sampling_mode="random"
: Randomly sample the sequences from the dataset (other option istake_first
for taking the firstM
sequences).data.data_subsampling.split="validation"
: Use the validation split of the dataset (other options aretrain
andtest
ordev
, depending on the dataset).data.data_subsampling.sampling_seed=778
: Set the random seed for the random sequences sampling.language_model="roberta-base"
: Use the RoBERTa base model for embeddings.embeddings.embedding_data_handler.mode="regular"
: Use the regular mode for the embeddings (other option ismasked_token
for masked language models).embeddings_data_prep.sampling.num_samples=3000
: This many non-padding tokens are sampled from the sequences in the embeddings data preparation step.local_estimates=twonn
: Compute the TwoNN-based local estimates (other options arelpca
for local PCA based dimension estimates).local_estimates.filtering.num_samples=500
: This many non-padding tokens are sampled from the tokens (i.e., setN=500
as size of the token sub-sample).local_estimates.pointwise.absolute_n_neighbors=128
: Use 128 neighbors for the pointwise local estimates (i.e., setL=128
as the local neighborhood size).
The results will be saved in the data_dir
specified in the config file, and the file path will contain the information about the model and the dataset used, together with additional hyperparameter choices.
For example, the results for the command above will be saved in the following directory:
data/analysis/local_estimates/data=wikitext-103-v1_strip-True_rm-empty=True_spl-mode=proportions_spl-shuf=True_spl-seed=0_tr=0.8_va=0.1_te=0.1_ctxt=dataset_entry_feat-col=ner_tags/split=validation_samples=512_sampling=random_sampling-seed=778/edh-mode=regular_lvl=token/add-prefix-space=False_max-len=512/model=roberta-base_task=masked_lm_dr=defaults/layer=-1_agg=mean/norm=None/sampling=random_seed=42_samples=3000/desc=twonn_samples=500_zerovec=keep_dedup=array_deduplicator_noise=do_nothing/
After the computation, this directory contains the following files:
.
├── additional_distance_computations_results.json
├── array_for_estimator.npy # <-- The array of vectors used for the local estimates computation (optional)
├── global_estimate.npy # <-- The global estimate (optional)
└── n-neighbors-mode=absolute_size_n-neighbors=128
├── additional_pointwise_results_statistics.json # <-- The statistics of the pointwise results
├── local_estimates_pointwise_array.npy # <-- The vector of pointwise local estimate results
└── local_estimates_pointwise_meta.pkl # <-- The metadata of the vector of the pointwise computation
As a reference, you can also take a look at the VS Code launch configuration, which contains various configurations for running the pipeline in different ways. In the following sections, we will explain how to set up the experiments that we present in the paper.
For finetuning models on a language modeling task, we provide another uv run command, that can be used with different configurations:
uv run finetune_language_model
To run the fine-tunings with the same parameters as in the paper, use the following script:
./topollm/experiments/fine_tuning_induces_dataset_specific_shifts_in_heterogeneous_local_dimensions/run_multiple_finetunings.sh
By default, the fine-tuned models are saved in the data/models/finetuned_models
directory, with paths that describe the model and the dataset used for fine-tuning.
If successful, the fine-tuning script will also save a config file for the fine-tuned model into the configs/language_model
using the short name of the model.
For example, the config file of a RoBERTa base model fine-tuned on the first 10000 sequences of the MultiWOZ2.1 train dataset will be saved as:
configs/language_model/roberta-base-masked_lm-defaults_multiwoz21-rm-empty-True-do_nothing-ner_tags_train-10000-take_first-111_standard-None_5e-05-linear-0.01-5.yaml
The short model name roberta-base-masked_lm-defaults_multiwoz21-rm-empty-True-do_nothing-ner_tags_train-10000-take_first-111_standard-None_5e-05-linear-0.01-5
can then be used to select a model in the local estimates computation pipeline.
To compute the local estimates for the base and finetuned models, you can use the pipeline, and select the model via the language_model
parameter.
We provide an example script to produce the data which is shown in the paper in Section 4.1:
./topollm/experiments/fine_tuning_induces_dataset_specific_shifts_in_heterogeneous_local_dimensions/run_compute_local_estimates_for_base_and_finetuned_models.sh
The violin plots in the paper, which compare the local estimate distribution between base and finetuned models, are created with the script:
uv run python3 topollm/plotting/plot_scripts_for_paper_draft/violin_plots_from_local_estimates.py
Refer to our separate grokking-via-lid
repository for instructions on how to run these experiments.
To train the dialogue state tracking models for which we compute the local estimates, use the official TripPy-R codebase. This repository contains information on how to set up the environment, obtain the data, and run the training scripts.
For reproducibility, we provide the exact training script that we used for training the TripPy-R models here, and a script to convert the data from the TripPy-R output format to a format compatible with the local estimates computation pipeline.
# Train the TripPy-R models
./topollm/experiments/local_dimensions_detect_exhaustion_of_training_capabilities/run_train_and_run_eval_of_trippy_r_models.sh
# Convert the TripPy-R output format to a format compatible with the local estimates computation pipeline
uv run topollm/experiments/local_dimensions_detect_exhaustion_of_training_capabilities/data_post_processing/load_cached_features_and_save_into_format_for_topo_llm.py --data_mode "trippy_r"
- You should probably update the values of the environment variables
CONVLAB3_REPOSITORY_BASE_PATH
andTOOLS_DIR
to the location of your local ConvLab-3 and TripPy-R repositories. - To proceed with the local estimates computation, you also need to update the model file paths in the config file
configs/language_model/roberta-base-trippy_r_multiwoz21_short_runs.yaml
to the location where you place the model files. - Similarly, you need to update the data paths in the config file
configs/data/trippy_r_dataloaders_processed.yaml
.
Use this script to compute the local estimates for the TripPy-R data and models:
./topollm/experiments/local_dimensions_detect_exhaustion_of_training_capabilities/run_compute_local_estimates_for_trippy_r_checkpoints.sh
Once all data is available, you can run the script topollm/experiments/local_dimensions_detect_exhaustion_of_training_capabilities/create_plots.sh
.
For training the Emotion Recognition Models, you should use the official ConvLab-3 setup, which is explained in that repository.
For reproducibility, we provide the exact training script that we used for training the ERC models here
./topollm/experiments/local_dimensions_detect_overfitting/run_train_erc_models.sh
- You need to update the values of the environment variables
CONVLAB3_REPOSITORY_BASE_PATH
. - To proceed with the local estimates computation, check the file paths in the model config file
configs/language_model/bert-base-uncased-ContextBERT-ERToD_emowoz_basic_setup.yaml
. - For the EmoWOZ data in a format compatible with this repository, check the config file
configs/data/ertod_emowoz.yaml
.
Run the following script to compute the local estimates for the ERC models:
./topollm/experiments/local_dimensions_detect_overfitting/run_compute_local_estimates_for_erc_model_checkpoints.sh
- You first need to parse the model task performance results from the log files using the script
topollm/task_performance_analysis/erc_models/parse_EmoLoop_ContextBERT_ERToD_logfile.py
. - Once all data is available, you can run the script
topollm/experiments/local_dimensions_detect_overfitting/create_plots.sh
.
We provide a python script that can be called via a poetry run command to run the tests.
uv run tests
@misc{ruppik2025morelocalintrinsicdimensions,
title={Less is More: Local Intrinsic Dimensions of Contextual Language Models},
author={Benjamin Matthias Ruppik and Julius von Rohrscheidt and Carel van Niekerk and Michael Heck and Renato Vukovic and Shutong Feng and Hsien-chin Lin and Nurul Lubis and Bastian Rieck and Marcus Zibrowius and Milica Gašić},
year={2025},
eprint={2506.01034},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.01034},
note={To appear in NeurIPS 2025},
}