ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Overview

Large language models (LLMs) often display brittle refusal behaviors that can be bypassed by simple linguistic variations, such as tense jailbreaking. ASGuard (Activation-Scaling Guard) introduces a mechanistically-informed framework to address this vulnerability with precision and interpretability.
(1) Circuit Analysis: Identifies attention heads causally linked to tense-based jailbreaking.
(2) Activation Scaling: Learns channel-wise scaling vectors to recalibrate vulnerable heads.
(3) Preventative Fine-Tuning: Reinforces robust refusal mechanisms while preserving model utility.
Evaluated across multiple LLMs, ASGuard achieves a Pareto-optimal tradeoff between safety and general capabilities, effectively reducing attack success rates while minimizing over-refusal. This work highlights how mechanistic insights can be translated into practical, efficient, and targeted safety interventions, advancing reliable and interpretable AI alignment.

Installation

To ensure compatibiliity with other libraries, we recommend using the folliwng versions. You can adjust it based on your environments:

Python >= 3.10.14
CUDA 12.2

Follow the order of installation.

Clone the repository:

git clone https://github.com/dmis-lab/ASGuard.git
cd ASGuard

Install dependencies:
```
pip install -r requirements.txt
```

Implementation

Step 1: Circuit Analysis

⚠️ Currently we do not support circuit construction. Please wait for an update.

Step 2: Activation Scaling

bash run_scaling.sh

Step 3: Preventative Fine-Tuning

bash run_prevent.sh

Evaluation

⚠️ Currently we do not support evaluation. Please wait for an update.

Contact

For any questions or issues, feel free to reach out to [522yein (at) korea.ac.kr].

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
jailbreak_classification		jailbreak_classification
scaling_library		scaling_library
README.md		README.md
analysis_stas.py		analysis_stas.py
model.py		model.py
requirements.txt		requirements.txt
run_prevent.sh		run_prevent.sh
run_scaling.sh		run_scaling.sh
train_pft.py		train_pft.py
train_sv.py		train_sv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Overview

Installation

Implementation

Step 1: Circuit Analysis

Step 2: Activation Scaling

Step 3: Preventative Fine-Tuning

Evaluation

Contact

About

Uh oh!

Releases

Packages

Languages

dmis-lab/ASGuard

Folders and files

Latest commit

History

Repository files navigation

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Overview

Installation

Implementation

Step 1: Circuit Analysis

Step 2: Activation Scaling

Step 3: Preventative Fine-Tuning

Evaluation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages