APRT

The official implementation of our COLING-2025 paper "Automated Progressive Red Teaming"

Abstract

Ensuring the safety of large language models (LLMs) is paramount, yet identifying potential vulnerabilities is challenging. While manual red teaming is effective, it is time-consuming, costly and lacks scalability. Automated red teaming offers a more cost-effective alternative, automatically generating adversarial prompts to expose LLM vulnerabilities. However, in current efforts, a robust framework is absent, which explicitly frames red teaming as an effectively learnable task. To address this gap, we propose Automated Progressive Red Teaming (APRT) as an effectively learnable framework. APRT leverages three core modules: an Intention Expanding LLM that generates diverse initial attack samples, an Intention Hiding LLM that crafts deceptive prompts, and an Evil Maker to manage prompt diversity and filter ineffective samples. The three modules collectively and progressively explore and exploit LLM vulnerabilities through multi-round interactions. In addition to the framework, we further propose a novel indicator, Attack Effectiveness Rate (AER) to mitigate the limitations of existing evaluation metrics. By measuring the likelihood of eliciting unsafe but seemingly helpful responses, AER aligns closely with human evaluations. Extensive experiments with both automatic and human evaluations, demonstrate the effectiveness of APRT across both open- and closed-source LLMs. Specifically, APRT effectively elicits 54% unsafe yet useful responses from Meta's Llama-3-8B-Instruct, 50% from GPT-4o (API access), and 39% from Claude-3.5 (API access), showcasing its robust attack capability and transferability across LLMs (especially from open-source LLMs to closed-source LLMs).

Updates

(2024/7/4) Our paper is on arXiv! Check it out here!
(2024/11/30) Our paper is accepted by COLING-2025!
(2024/12/18) We have released a quick implementation of APRT, including both seed data and code!

Quick Start

Get code

git clone https://github.com/tjunlp-lab/APRT.git

Download checkpoints

Meta-Llama-3-8B-Instruct Llama-Guard-3-8B UltraRM-13b

Train initial checkpoints

# Please set load_init_model.json
sh init_intention_hiding.sh # train initial Intention Hiding LLM
sh init_intention_expanding.sh #  train initial Intention Expanding LLM

Initialize Experiments

sh init_exp.sh

Train APRT

sh auto_train.sh

Contact

If you have any questions about our work, please contact us via the following email:

Bojian Jiang: [email protected]

Citation

If you find this work useful in your research, please leave a star and cite our paper:

@misc{jiang2024automatedprogressiveredteaming,
      title={Automated Progressive Red Teaming}, 
      author={Bojian Jiang and Yi Jing and Tianhao Shen and Tong Wu and Qing Yang and Deyi Xiong},
      year={2024},
      eprint={2407.03876},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2407.03876}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
env		env
primary_steps		primary_steps
scripts		scripts
APRT.png		APRT.png
README.md		README.md
aprt_llama3.sh		aprt_llama3.sh
auto_train.sh		auto_train.sh
init_exp.sh		init_exp.sh
init_intention_expanding.sh		init_intention_expanding.sh
init_intention_hiding.sh		init_intention_hiding.sh
load_init_model.json		load_init_model.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

APRT

Abstract

Updates

Quick Start

Contact

Citation

About

Uh oh!

Releases

Packages

Languages

tjunlp-lab/APRT

Folders and files

Latest commit

History

Repository files navigation

APRT

Abstract

Updates

Quick Start

Contact

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages