Adversarial Reasoning at Jailbreaking Time

This is the official implementation of our paper in ICML 2025 "Adversarial Reasoning at Jailbreaking Time".

Abstract

As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of ``model jailbreaking'': eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation that achieves SOTA attack success rates (ASR) against many aligned LLMs, even the ones that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.

Getting started

Before running the experiments, you need to configure your API keys, log in to your WandB account, and set the target LLM address.

To query the black-box models, you must provide your API key(s) as environment variables. For example, if you are using TogetherAI’s API, set the key as follows:
```
export TOGETHERAI_API_KEY=[YOUR_API_KEY_HERE]
```
Alternatively, you can set your key in the first cell of perc.ipynb.
Log in to your WandB account (used for logging) before running the code. You can do this by running wandb login.
To read log-probability vectors from the target LLM, specify the appropriate address in the second cell of perc.ipynb.

Run Experiments

For the demo, run the notebook perc.ipynb which loads the target LLM and executes the attack. If you are attacking a black-box model, you will need to:

Specify the address of the surrogate model in the notebook, and
Call get_target_responses_API_prop from utils.py to query the black-box LLM.

The default parameters for the attack are num_iters = 15 (number of iterations for refinement), num_branches = 8 (number of genrated children from each reasoning string), memory = 32 (buffer size for the search method), and K = 2 (bucket size for adding randomness to the feedabacks).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
LICENSE		LICENSE
README.md		README.md
alg.py		alg.py
convs.py		convs.py
perc.ipynb		perc.ipynb
strings.py		strings.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adversarial Reasoning at Jailbreaking Time

Abstract

Getting started

Run Experiments

About

Uh oh!

Releases

Packages

Languages

License

Helloworld10011/Adversarial-Reasoning

Folders and files

Latest commit

History

Repository files navigation

Adversarial Reasoning at Jailbreaking Time

Abstract

Getting started

Run Experiments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages