🛡️JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation
This repository includes the data and python implementation for the paper "JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation." (Accepted by the 34rd USENIX Security Symposium 2025).
This repo includes the following main components:
-
Test Datasets: The datasets used in our evaluation, including Jailbreak prompts from nine distinct attack types across five target LLMs, benign prompts from the Alpaca dataset, and harmful prompts from AdvBench and Hex-PHI datasets. These datasets enable a comprehensive evaluation of JBShield’s detection and mitigation capabilities.
-
Concept Extraction Code: Based on our proposed concept analysis methodology, we provides the implementation for extracting interpretable tokens associated with toxic and jailbreak concepts. This component aligns with our promise to share tools that facilitate understanding and manipulation of high-level concepts in LLM hidden representations.
-
JBShield-D
andJBShield-M
Implementation: We provides complete testing scripts for evaluating the detection (JBShield-D
) and mitigation (JBShield-M
) components ofJBShield
. These scripts are tailored to assess performance across five target LLMs, ensuring reproducibility of the results reported in the paper.
⭐ Note: The experimental results presented in this artifact may exhibit some variations due to differences in testing environments, the randomness in calibration set selection, and dataset size discrepancies across different jailbreak methods (DrAttack and Puzzler contain fewer samples compared to others). Despite these variations, the overall trends and effectiveness of JBShield
remain stable, as demonstrated by the extensive evaluations conducted in our study.
Please feel free to contact us at email if you have any questions about this repo.
The minimum hardware requirement is two GPUs with at least 24GB VRAM each (e.g., RTX 3090 or RTX 4090). For optimal performance, we recommend a setup with 4 RTX 4090 GPUs (24GB VRAM each) or 1 A100 GPUs (80GB VRAM each).
The code for JBShield
runs with Python 3 and requires Pytorch. We recommend using Anaconda
or miniconda
for python. Our code has been tested with python=3.12.8
and torch=2.5.1
on linux. First, create a conda environment activate it:
conda create -n jbshield python=3.12
conda activate jbshield
Please follow this to install Pytorch and pip install
the following packages:
python -m pip install transformers fschat pandas accelerate nltk sentencepiece scikit-learn protobuf
Our dataset is located in ./data
. The jailbreak prompts are located in ./data/jailbreak
, while the harmful and harmless prompts can be found in ./data/harmful{}.csv
and ./data/harmless{}.csv
, respectively.
We randomly divided the data into calibration and test sets. The calibration data is located in ./data/jailbreak/{jailbreak_name}/{model_name}_calibration.json
, while the test data can be found in ./data/jailbreak/{jailbreak_name}/{model_name}_test.json
.
We selected five target LLMs and one judge LLM, as detailed in the table below:
Model name | Link |
---|---|
Mistral-7B-Instruct-v0.2 | 🤗[Huggingface link] |
Llama-2-7b-chat-hf | 🤗[Huggingface link] |
Meta-Llama-3-8B-Instruct | 🤗[Huggingface link] |
vicuna-7b-v1.5 | 🤗[Huggingface link] |
vicuna-13b-v1.5 | 🤗[Huggingface link] |
ft-mistral-7b-instruct-v0.2-sorry-bench-202406 | 🤗[Huggingface link] |
Mistral, LLaMA-2, LLaMA-3, Vicuna-7B, and Vicuna-13B are the target models in our evaluation. We need git-lfs
to download models. Install git-lfs
by:
# For ubuntu
sudo apt install git-lfs
Run the following shell commands to download models:
cd models
chmod +x ./download_models.sh
./download_models.sh
Place the models in the ./model
, structured as follows:
./models
├── Llama-2-7b-chat-hf
├── Meta-Llama-3-8B-Instruct
├── Mistral-7B-Instruct-v0.2
├── ft-mistral-7b-instruct-v0.2-sorry-bench-202406
├── vicuna-13b-v1.5
└── vicuna-7b-v1.5
We provide the code and results for concept analysis on five target LLMs. Run the following shell commands to extract concept-related tokens:
chmod +x ./interpret.sh
./interpret.sh
The results are saved in ./interpret_results
(see Table 2 and Appendix A in our paper).
Run the following commands to evaluate the jailbreak detection performance of JBShield-D
on the five LLMs:
chmod +x ./evaluate_detection.sh
./evaluate_detection.sh
The results are saved in /logs/JBShield-D_{model_name}.log
. We have also provided the logs from our runs in the same directory.
Performance (accuracy/F1-score) of JBShield-D
across different LLMs (see Table 4 in our paper):
Run the following commands to evaluate the jailbreak mitigation performance of JBShield-M
on the five LLMs:
chmod +x ./evaluate_mitigation.sh
./evaluate_mitigation.sh
The results are saved in /logs/JBShield-M.log
. We have also provided the logs from our runs in the same directory.
Performance (attack success rate) of JBShield-M
across different LLMs (see Table 7 in our paper):
If you find this helpful, please cite our work:
@misc{zhang2025jbshield,
title={JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation},
author={Shenyi Zhang and Yuchen Zhai and Keyan Guo and Hongxin Hu and Shengnan Guo and Zheng Fang and Lingchen Zhao and Chao Shen and Cong Wang and Qian Wang},
year={2025},
eprint={2502.07557},
archivePrefix={arXiv},
primaryClass={cs.CR}
}