Skip to content

Investigating embedding spaces generated by language models from a topological perspective via local intrinsic dimension (LID).

License

Notifications You must be signed in to change notification settings

aidos-lab/Topo_LLM_public

Repository files navigation

Less is More: Local Intrinsic Dimensions of Contextual Language Models (Topo_LLM)

Overview

This repository contains code for analyzing the representations produced by contextual language models from a topological perspective. In particular, we study changes in the local intrinsic dimension (LID) of the model's hidden states in different scenarios.

Details can be found in our paper Less is More: Local Intrinsic Dimensions of Contextual Language Models.

Abstract:

Understanding the internal mechanisms of large language models (LLMs) remains a challenging and complex endeavor. Even fundamental questions, such as how fine-tuning affects model behavior, often require extensive empirical evaluation. In this paper, we introduce a novel perspective based on the geometric properties of contextual latent embeddings to study the effects of training and fine-tuning. To that end, we measure the local dimensions of a contextual language model's latent space and analyze their shifts during training and fine-tuning. We show that the local dimensions provide insights into the model's training dynamics and generalization ability. Specifically, the mean of the local dimensions predicts when the model's training capabilities are exhausted, as exemplified in a dialogue state tracking task, overfitting, as demonstrated in an emotion recognition task, and grokking, as illustrated with an arithmetic task. Furthermore, our experiments suggest a practical heuristic: reductions in the mean local dimension tend to accompany and predict subsequent performance gains. Through this exploration, we aim to provide practitioners with a deeper understanding of the implications of fine-tuning on embedding spaces, facilitating informed decisions when configuring models for specific applications. The results of this work contribute to the ongoing discourse on the interpretability, adaptability, and generalizability of LLMs by bridging the gap between intrinsic model mechanisms and geometric properties in the respective embeddings.

Quick start

# Clone the repository and navigate to the directory
git clone [REPOSITORY_URL] Topo_LLM
cd Topo_LLM

# Call the setup script to set the environment variables
./topollm/setup/setup_environment.sh

# Run the pipeline to compute local estimates with default parameters
uv run pipeline_local_estimates

Installation

Prerequisites

  • Python 3.12
  • uv package manager

MacOS specific instructions

On MacOS, you can install uv with Homebrew:

  • brew install uv.

Ubuntu/Debian instructions

Install pipx via apt and then install uv via pipx:

sudo apt update
sudo apt install pipx
pipx ensurepath

pipx install uv

Installation instructions with uv

  1. Install python version with uv:
uv python install 3.12
  1. You can check the installed python versions with:
uv python list
  1. Lock the dependencies with uv and sync the environment:
uv lock
uv sync
  1. Start a python interpreter with the local environment:
uv run python3

Specific instructions for HPC Cluster

On some HPC clusters, you might need to pin a torch version in the pyproject.toml file, to make the installation of torch and a compatible CUDA version work. For example, on our HPC cluster, it currently appears to work when you set the torch version to 2.3.*:

torch = "2.3.*"

Project-specific setup

  1. Set the correct environment variables used in the project config. This step can be achieved by running the setup script in the topollm/setup/ directory once.
./topollm/setup/setup_environment.sh
  1. If required, e.g. when running jobs on an HPC cluster, set the correct environment variables in the .env file in the project root directory.

  2. For setting up the repository to support job submissions to the a HPC cluster, follow the instructions using our Hydra HPC Launcher. Additional submission scripts are located in the topollm/scripts/submission_scripts directory.

  3. Download the files necessary for nltk: Start a python interpreter and run the following:

>>> import nltk
>>> nltk.download('punkt_tab')
>>> nltk.download('averaged_perceptron_tagger_eng')

Project Structure

Config file management

We use Hydra for the config managment. Please see the documentation and the experiments below for examples on how to use the config files in configs and command line overrides.

Data directory

The data directory is set in most of the python scripts via the Hydra config (see the script topollm/config_classes/get_data_dir.py for a common function to access the data directory path). We additionally set the path to the local data directory in the .env file in the project root directory, in the variable LOCAL_TOPO_LLM_DATA_DIR. Most of the shell scripts use this variable to set the data directory path.

For compatibility, please make sure that these paths are set correctly and point to the same directory.

Datasets

The following datasets can be used via their config file in configs/data:

To prepare the dialogue datasets, which will be saved as .jsonl files in the directory data/datasets/dialogue_datasets, use the script topollm/data_processing/prepare_dialogue_data.py. The dialogue dataset preparation requires an installation of ConvLab-3, which is not included in the uv environment to avoid version conflicts.

Usage

General instructions to run the pipeline

We provide uv run commands in the pyproject.toml file for the most important entry points of the module. Most importantly, the following command runs the full pipeline:

  • from computing embeddings,
  • to the embedding data preparation (collecting token embeddings, removing padding token embedding),
  • and finally computing the local estimates (for example, the TwoNN-based local dimension).
uv run pipeline_local_estimates

Usually, you will want to select a specific dataset and model to run the pipeline on, and change additional hyperparameters. You can do this by passing the parameters as command line arguments to the uv run command:

uv run pipeline_local_estimates \
  data="wikitext-103-v1" \
  data.data_subsampling.number_of_samples=512 \
  data.data_subsampling.sampling_mode="random" \
  data.data_subsampling.split="validation" \
  data.data_subsampling.sampling_seed=778 \
  language_model="roberta-base" \
  embeddings.embedding_data_handler.mode="regular" \
  embeddings_data_prep.sampling.num_samples=3000 \
  local_estimates=twonn \
  local_estimates.filtering.num_samples=500 \
  local_estimates.pointwise.absolute_n_neighbors=128

The parameters in the command line override the default parameters in the config file. Here, we explain the parameters in the example command:

  • data="wikitext-103-v1": Compute local estimates for the Wikipedia dataset.
  • data.data_subsampling.number_of_samples=512: Sample 512 sequences from the dataset (i.e., set M=512 as size of the text corpus sequence sub-sample).
  • data.data_subsampling.sampling_mode="random": Randomly sample the sequences from the dataset (other option is take_first for taking the first M sequences).
  • data.data_subsampling.split="validation": Use the validation split of the dataset (other options are train and test or dev, depending on the dataset).
  • data.data_subsampling.sampling_seed=778: Set the random seed for the random sequences sampling.
  • language_model="roberta-base": Use the RoBERTa base model for embeddings.
  • embeddings.embedding_data_handler.mode="regular": Use the regular mode for the embeddings (other option is masked_token for masked language models).
  • embeddings_data_prep.sampling.num_samples=3000: This many non-padding tokens are sampled from the sequences in the embeddings data preparation step.
  • local_estimates=twonn: Compute the TwoNN-based local estimates (other options are lpca for local PCA based dimension estimates).
  • local_estimates.filtering.num_samples=500: This many non-padding tokens are sampled from the tokens (i.e., set N=500 as size of the token sub-sample).
  • local_estimates.pointwise.absolute_n_neighbors=128: Use 128 neighbors for the pointwise local estimates (i.e., set L=128 as the local neighborhood size).

The results will be saved in the data_dir specified in the config file, and the file path will contain the information about the model and the dataset used, together with additional hyperparameter choices. For example, the results for the command above will be saved in the following directory:

data/analysis/local_estimates/data=wikitext-103-v1_strip-True_rm-empty=True_spl-mode=proportions_spl-shuf=True_spl-seed=0_tr=0.8_va=0.1_te=0.1_ctxt=dataset_entry_feat-col=ner_tags/split=validation_samples=512_sampling=random_sampling-seed=778/edh-mode=regular_lvl=token/add-prefix-space=False_max-len=512/model=roberta-base_task=masked_lm_dr=defaults/layer=-1_agg=mean/norm=None/sampling=random_seed=42_samples=3000/desc=twonn_samples=500_zerovec=keep_dedup=array_deduplicator_noise=do_nothing/

After the computation, this directory contains the following files:

.
├── additional_distance_computations_results.json
├── array_for_estimator.npy # <-- The array of vectors used for the local estimates computation (optional)
├── global_estimate.npy # <-- The global estimate (optional)
└── n-neighbors-mode=absolute_size_n-neighbors=128
    ├── additional_pointwise_results_statistics.json # <-- The statistics of the pointwise results
    ├── local_estimates_pointwise_array.npy # <-- The vector of pointwise local estimate results
    └── local_estimates_pointwise_meta.pkl # <-- The metadata of the vector of the pointwise computation

As a reference, you can also take a look at the VS Code launch configuration, which contains various configurations for running the pipeline in different ways. In the following sections, we will explain how to set up the experiments that we present in the paper.

Experiments: Fine-Tuning Induces Dataset-Specific Shifts in Heterogeneous Local Dimensions

Fine-tuning the language model

For finetuning models on a language modeling task, we provide another uv run command, that can be used with different configurations:

uv run finetune_language_model

To run the fine-tunings with the same parameters as in the paper, use the following script:

./topollm/experiments/fine_tuning_induces_dataset_specific_shifts_in_heterogeneous_local_dimensions/run_multiple_finetunings.sh

By default, the fine-tuned models are saved in the data/models/finetuned_models directory, with paths that describe the model and the dataset used for fine-tuning. If successful, the fine-tuning script will also save a config file for the fine-tuned model into the configs/language_model using the short name of the model. For example, the config file of a RoBERTa base model fine-tuned on the first 10000 sequences of the MultiWOZ2.1 train dataset will be saved as:

configs/language_model/roberta-base-masked_lm-defaults_multiwoz21-rm-empty-True-do_nothing-ner_tags_train-10000-take_first-111_standard-None_5e-05-linear-0.01-5.yaml

The short model name roberta-base-masked_lm-defaults_multiwoz21-rm-empty-True-do_nothing-ner_tags_train-10000-take_first-111_standard-None_5e-05-linear-0.01-5 can then be used to select a model in the local estimates computation pipeline.

Local estimates computation for the finetuned models

To compute the local estimates for the base and finetuned models, you can use the pipeline, and select the model via the language_model parameter. We provide an example script to produce the data which is shown in the paper in Section 4.1:

./topollm/experiments/fine_tuning_induces_dataset_specific_shifts_in_heterogeneous_local_dimensions/run_compute_local_estimates_for_base_and_finetuned_models.sh

Create the violin plots

The violin plots in the paper, which compare the local estimate distribution between base and finetuned models, are created with the script:

uv run python3 topollm/plotting/plot_scripts_for_paper_draft/violin_plots_from_local_estimates.py

Experiments: Local Dimensions Detect Grokking

Refer to our separate grokking-via-lid repository for instructions on how to run these experiments.

Experiments: Local Dimensions Detect Exhaustion of Training Capabilities

Train the Trippy-R dialogue state tracking models

To train the dialogue state tracking models for which we compute the local estimates, use the official TripPy-R codebase. This repository contains information on how to set up the environment, obtain the data, and run the training scripts.

For reproducibility, we provide the exact training script that we used for training the TripPy-R models here, and a script to convert the data from the TripPy-R output format to a format compatible with the local estimates computation pipeline.

# Train the TripPy-R models
./topollm/experiments/local_dimensions_detect_exhaustion_of_training_capabilities/run_train_and_run_eval_of_trippy_r_models.sh
# Convert the TripPy-R output format to a format compatible with the local estimates computation pipeline
uv run topollm/experiments/local_dimensions_detect_exhaustion_of_training_capabilities/data_post_processing/load_cached_features_and_save_into_format_for_topo_llm.py --data_mode "trippy_r"
  • You should probably update the values of the environment variables CONVLAB3_REPOSITORY_BASE_PATH and TOOLS_DIR to the location of your local ConvLab-3 and TripPy-R repositories.
  • To proceed with the local estimates computation, you also need to update the model file paths in the config file configs/language_model/roberta-base-trippy_r_multiwoz21_short_runs.yaml to the location where you place the model files.
  • Similarly, you need to update the data paths in the config file configs/data/trippy_r_dataloaders_processed.yaml.

Local estimates computation for the Trippy-R models

Use this script to compute the local estimates for the TripPy-R data and models:

./topollm/experiments/local_dimensions_detect_exhaustion_of_training_capabilities/run_compute_local_estimates_for_trippy_r_checkpoints.sh

Create plots comparing local dimensions and task performance for the Trippy-R models

Once all data is available, you can run the script topollm/experiments/local_dimensions_detect_exhaustion_of_training_capabilities/create_plots.sh.

Experiments: Local Dimensions Detect Overfitting

Train the ERC models

For training the Emotion Recognition Models, you should use the official ConvLab-3 setup, which is explained in that repository.

For reproducibility, we provide the exact training script that we used for training the ERC models here

./topollm/experiments/local_dimensions_detect_overfitting/run_train_erc_models.sh
  • You need to update the values of the environment variables CONVLAB3_REPOSITORY_BASE_PATH.
  • To proceed with the local estimates computation, check the file paths in the model config file configs/language_model/bert-base-uncased-ContextBERT-ERToD_emowoz_basic_setup.yaml.
  • For the EmoWOZ data in a format compatible with this repository, check the config file configs/data/ertod_emowoz.yaml.

Local estimates computation for the ERC models

Run the following script to compute the local estimates for the ERC models:

./topollm/experiments/local_dimensions_detect_overfitting/run_compute_local_estimates_for_erc_model_checkpoints.sh

Create plots comparing local dimensions and task performance for the ERC models

  • You first need to parse the model task performance results from the log files using the script topollm/task_performance_analysis/erc_models/parse_EmoLoop_ContextBERT_ERToD_logfile.py.
  • Once all data is available, you can run the script topollm/experiments/local_dimensions_detect_overfitting/create_plots.sh.

Run tests

We provide a python script that can be called via a poetry run command to run the tests.

uv run tests

Citation

@misc{ruppik2025morelocalintrinsicdimensions,
      title={Less is More: Local Intrinsic Dimensions of Contextual Language Models}, 
      author={Benjamin Matthias Ruppik and Julius von Rohrscheidt and Carel van Niekerk and Michael Heck and Renato Vukovic and Shutong Feng and Hsien-chin Lin and Nurul Lubis and Bastian Rieck and Marcus Zibrowius and Milica Gašić},
      year={2025},
      eprint={2506.01034},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.01034},
      note={To appear in NeurIPS 2025},
}

About

Investigating embedding spaces generated by language models from a topological perspective via local intrinsic dimension (LID).

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •