[NAACL 2025 Main Selected Oral] Prompt Compression for Large Language Models: A Survey

Content

🚀 News • ✏️ Todo • ✨ Introduction

👀 Examples • 🌳 Tree Overview • 📖 Paper List • 🎨 Visualisations

📌 Citation • 🔖 License

Links

Project Page • Paper

🚀 News

[2025.01.22] This paper was accepted by NAACL 2025 Main!
[2024.10.16] The paper was uploaded to Arxiv.

✏️ Todo

✨ Introduction

A survey on prompt compression methods with insights and visualisations.

Contributions:

Methods Overview: An overview of prompt compression methods, categorized into hard prompt methods and soft prompt methods.
Insights: Various ways to understand method mechanisms.
Visualisations: Illustrations for various prompt compression methods.

👀 Examples

Illustrative examples of prompt compression methods. Hard prompt methods remove low-information tokens or paraphrase for conciseness. Soft prompt methods compress text into a smaller number of special tokens, $<c_n>$. The grids below visualize attention patterns, where the y-axis represents the sequence of tokens, and the x-axis shows the tokens they attend to. (Bottom left) Original prompt: Each token attends to all previous tokens. (Bottom middle) Hard prompt (filtering): Each token cannot attend to previous deleted tokens ($D_i$). (Bottom right) Soft prompt (whole): After the compression token ($C_i$) attends to all prior input tokens ($I_i$), subsequent output tokens ($O_i$) cannot attend to tokens before the compression token.

🌳 Tree Overview

Hierarchical overview of prompt compression methods and their downstream adaptions. For downstream adaptations, compression methods not belonging to specific categories can be classified into general QA.

📖 Paper List

Hard Prompt Methods:

Filtering:
- General:
  - [SelectiveContext] Compressing Context to Enhance Inference Efficiency of Large Language Models
  - [LLMLingua] Compressing Prompts for Accelerated Inference of Large Language Models
  - [LongLLMLingua] Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
  - [AdaComp] Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models
- Distillation Enhanced:
  - [LLMLingua-2] Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
- RL Enhanced:
  - [TACO-RL] Task Aware Prompt Compression Optimization with Reinforcement Learning
  - [PCRL] Priority Convention Reinforcement Learning for Microscopically Sequencable Multi-agent Problems
- Embedding Enhanced:
  - [CPC] Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
  - [TCRA-LLM] Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction
Paraphrasing:
- (No sub category)
  - [Nano-Capsulator] Learning to Compress Prompt in Natural Language Formats
  - [CompAct] Compressing Retrieved Documents Actively for Question Answering
  - [FAVICOMP] Familiarity-Aware Evidence Compression for Retrieval-Augmented Generation

Soft Prompt Methods:

Decoder Only:
- Not Finetuned:
  - [CC] Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models
- Finetuned:
  - [GIST] Learning to Compress Prompts with Gist Tokens
  - [AutoCompressor] Adapting Language Models to Compress Contexts
Encoder-decoder:
- Both Finetuned:
  - [COCOM] Context Embeddings for Efficient Answer Generation in RAG
  - [LLoCO] Learning Long Contexts Offline
- Finetuned Encoder:
  - [ICAE] In-context Autoencoder for Context Compression in a Large Language Model
  - [500xCompressor] Generalized Prompt Compression for Large Language Models
  - [QGC] Retaining Key Information under High Compression Ratios: Query-Guided Compressor for LLMs
- Embedding Encoder:
  - [xRAG] Extreme Context Compression for Retrieval-augmented Generation with One Token
- Projector:
  - [UniICL] Unifying Demonstration Selection and Compression for In-Context Learning

Applications:

RAG:
- (No sub category)
  - [xRAG] Extreme Context Compression for Retrieval-augmented Generation with One Token
  - [RECOMP] Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation
  - [COCOM] Context Embeddings for Efficient Answer Generation in RAG
  - [CompAct] Compressing Retrieved Documents Actively for Question Answering
  - [FAVICOMP] Familiarity-Aware Evidence Compression for Retrieval-Augmented Generation
  - [AdaComp] Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models
  - [LLoCO] Learning Long Contexts Offline
  - [TCRA-LLM] Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction
Agents:
- (No sub category)
  - [HD-Gist] Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage
  - [Link] Concise and Precise Context Compression for Tool-Using Language Models
Domain-specific tasks:
- (No sub category)
  - [Tag-llm] Repurposing General-Purpose LLMs for Specialized Domains
  - [CoLLEGe] Concept Embedding Generation for Large Language Models
Others:
- (No sub category)
  - [ICL] Unifying Demonstration Selection and Compression for In-Context Learning
  - [Role Playing] Extensible Prompts for Language Models on Zero-shot Language Style Customization
  - [Functions] Function Vectors in Large Language Models

🎨 Visualisations

Architectures for various prompt compression models by hard prompt methods. For SelectiveContext and LLMLingua, the bottom language models filter the prompt tokens without modifying them, serving as selection mechanisms. In Nano-Capsulator, the bottom LLM generates a paraphrased version of the input prompt which then serves as input for the LLM above. "SLM" means "small language model". "Close LLM" refers to closed-source language models that only accept natural language inputs through API calls.

Architectures for various prompt compression models by soft prompt methods. Tokens with diagonal stripes represent the output tokens processed by the language models. Different from hard prompt methods, the bottom LLMs in soft prompt methods process the input tokens, and their outputs (tokens with diagonal stripes) serve as input for the LLMs above.

📌 Citation

@inproceedings{li-etal-2025-prompt,
    title = "Prompt Compression for Large Language Models: A Survey",
    author = "Li, Zongqian  and
      Liu, Yinhong  and
      Su, Yixuan  and
      Collier, Nigel",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.368/",
    pages = "7182--7195",
    ISBN = "979-8-89176-189-6",
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Figures		Figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[NAACL 2025 Main Selected Oral] Prompt Compression for Large Language Models: A Survey

🚀 News

✏️ Todo

✨ Introduction

Contributions:

👀 Examples

🌳 Tree Overview

📖 Paper List

Hard Prompt Methods:

Soft Prompt Methods:

Applications:

🎨 Visualisations

📌 Citation

🔖 License

About

Uh oh!

Releases

Packages

ZongqianLi/Prompt-Compression-Survey

Folders and files

Latest commit

History

Repository files navigation

[NAACL 2025 Main Selected Oral] Prompt Compression for Large Language Models: A Survey

🚀 News

✏️ Todo

✨ Introduction

Contributions:

👀 Examples

🌳 Tree Overview

📖 Paper List

Hard Prompt Methods:

Soft Prompt Methods:

Applications:

🎨 Visualisations

📌 Citation

🔖 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages