Eliciting Suppressed Knowledge (ESK)

Abstract

Where do transformer models store their true "thoughts" when they say something they know is false? We demonstrate that suppressed neural activations are a more useful source of knowledge than the model's direct outputs or standard hidden states. By isolating and probing these suppressed activation patterns, we achieve ~20% AUROC improvements on TruthfulQA compared to the LLM answer, and . This confirms suppressed activations contain knowledge that the model possesses but deliberately inhibits during generation.

Key Results

Linear probes targeting suppressed activations consistently outperform both naive outputs and standard activation probes across model scales. The performance gap (~X%) represents recoverable truthful knowledge that remains encoded but deliberately suppressed during normal generation.

Research Question

Recent evidence demonstrates that transformer models systematically misrepresent their internal reasoning:

Claude 3.7 System Card research reveals only 30% faithfulness in chain-of-thought reasoning, indicating models "do not reliably report the presence and use of clues" that determined their answers (Anthropic, 2025)
OpenAI research confirms that models "learn to hide intent in the chain-of-thought" when penalized for expressing certain thoughts (OpenAI, 2025)

This evidence presents a fundamental question: If models systematically misrepresent their reasoning processes in tokens, where is their actual reasoning encoded?

Relation to Prior Work

Our approach connects directly to two emerging lines of research:

Suppression/Prediction Neural Dynamics:
- Gurnee et al. (2024) identified "universal neurons" across different model seeds, including prediction neurons (increasing probability of related tokens) and suppression neurons (decreasing probability of specific token classes)
- The architecture shows "a sudden shift towards a much larger number of suppression neurons" in final layers
- Lad et al. (2024) propose a "stages of inference" hypothesis with a final "residual sharpening" phase dominated by suppression dynamics
Unfaithful Chain-of-Thought:
- Anthropic (2025) demonstrates that even in leading models like Claude 3.7, chain-of-thought reasoning achieves only 30% faithfulness
- OpenAI (2025) shows that penalizing "bad thoughts" leads to models that "learn to hide intent" rather than genuinely correcting reasoning
- Both lines of evidence suggest models maintain internal representations that diverge from their expressed reasoning

Where previous work focused on architectural components (identifying suppression neurons) or documenting the unfaithfulness phenomenon, our research bridges these streams by showing we can extract more accurate information from the very activations being suppressed.

Hypothesis

Suppressed neural activations contain more accurate information than what appears in model outputs. A linear probe of these suppressed activations should therefore outperform both direct model outputs and probes of standard hidden states on truthfulness tasks.

Alternative hypothesis include:

It is non linear (we use linear probes here, this hypothesis not explored)
It is in the kv_cache of layers not directly connected to the hidden state (explored below with other activations)
it is in the residual stream (explored below)
it is in the output logits (explored below)

Method

Our approach isolates suppressed activations by leveraging the differential impact of layers on token probabilities:

Extract hidden states from model layers
Project to logit space using output embeddings
Compute layer-to-layer differences in logit space
Project negative changes back to activation space
Apply this suppression mask to hidden states
Train linear probes on these suppressed patterns

This method exploits the "residual sharpening" stage identified by Lad et al. (2024), specifically targeting information that the model actively filters out.

Key Results

Linear probes targeting suppressed activations consistently outperform both naive outputs and standard activation probes across model scales. The performance gap (~X%) represents recoverable truthful knowledge that remains encoded but deliberately suppressed during normal generation.

Performance Breakdown

top reduction for each data type

group	name	ROC AUC Score	data
mixed	supressed_hs(0.1)\magnitude(0.25)\sum	0.878431	supressed_hs
act sink rm	hidden_states\magnitude(0.99)\std	0.862745	hidden_states
supressed_hs	supressed_hs(1)\none\sum	0.858824	supressed_hs
llm prob ratio	llm_log_prob_true\|	0.843137	llm_log_prob_true
acts-self_attn	acts-self_attn\none\mean	0.810784	acts-self_attn
acts-mlp.up_proj	acts-mlp.up_proj\none\sum	0.763725	acts-mlp.up_proj
acts-mlp.down_proj	acts-mlp.down_proj\none\std	0.704902	acts-mlp.down_proj
supr_amounts	supr_amounts\none\sum	0.703922	supr_amounts
hidden_states	hidden_states\none\flatten	0.669608	hidden_states
llm_ans	llm_ans\|	0.639216	llm_ans

The highest-performing probe was on the final layer's suppressed activations (supressed_hs), following by filtering out high magnitudes (to remove activation sinks). This supports the hypothesis that the final "residual sharpening" stage specifically suppresses certain information pathways.

Remaining Questions

Is the KV Cache more usefull than the suppressed activations? This seems to be a promising alternative hypothesis that we have not tested.
Is this effect in all models (base, chat, multimodal, etc)?
Can this method improving interventions such as steering or forgetting

Citation

@software{clark2025esk,
  author = {Michael J Clark},
  title = {Eliciting Suppressed Knowledge: Recovering Truth from Suppressed Activations in Language Models},
  url = {https://github.com/wassname/eliciting_suppressed_knowledge},
  year = {2025},
}

License

MIT License

Appendix: Reproducing Results

Setup

# first install uv

# then install the requirements
uv sync

# note open the notebook in jupyter or vscode
code nbs/TQA_regr 3b.ipynb

see TQA_regr 3b.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
figs		figs
nbs		nbs
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
research_journal.md		research_journal.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Eliciting Suppressed Knowledge (ESK)

Abstract

Key Results

Research Question

Relation to Prior Work

Hypothesis

Method

Key Results

Performance Breakdown

Remaining Questions

Citation

License

Appendix: Reproducing Results

About

Uh oh!

Releases

Packages

Uh oh!

Languages

wassname/eliciting_suppressed_knowledge

Folders and files

Latest commit

History

Repository files navigation

Eliciting Suppressed Knowledge (ESK)

Abstract

Key Results

Research Question

Relation to Prior Work

Hypothesis

Method

Key Results

Performance Breakdown

Remaining Questions

Citation

License

Appendix: Reproducing Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages