Official code for the paper Capability-Based Scaling Laws for LLM Red-Teaming ([arXiv]).
Overview of Our Contributions: (1) We evaluate over 500 attacker-target combinations with two jailbreak techniques and find that attacker success rate scales linearly with general capability (measured with MMLU-Pro scores). (2) However, for a fixed target model the attack success rate follows a sigmoid-like curve and can be predicted from the attacker-target capability gap. (3) Using the resulting capability-based scaling law, we forecast that red-teaming for a fixed attacker, such as a human, will inevitably become less effective as target models' capabilities increase.
This code was tested for Python 3.10. Start by cloning the repository:
git clone [repository-url]
cd capability-based-scaling
Create and activate a conda environment:
conda create -n model_unlocking python=3.10
conda activate model_unlocking
Install the required dependencies:
pip install -r requirements.txt
For running evaluations, install the LM Evaluation Harness:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
cd ..
For LLaMA-Factory training backend:
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install --no-deps -e .
cd ..
Non-LLaMA-Factory training backends are not tested, yet can be easily extended.
Unlock safety-tuned models (for use as attackers and judges):
python main.py --model_name vicuna_7b --training_backend "llama_factory"
python main.py \
--model_name vicuna_7b \
--training_backend "llama_factory" \
--learning_rate 2e-4 \
--per_device_train_batch_size 8 \
--num_epochs 3 \
--datasets_in_use shadow_alignment badllama alpaca_1k
Check config/training_config.yaml
for more options. Training datasets are provided in data/
.
See attacks/README.md
for instructions on running PAIR and Crescendo attacks.
See evaluation/README.md
for instructions on evaluating models on various benchmarks.
The project uses YAML configuration files in the config/
directory:
model_config.yaml
: Model-specific settings and pathstraining_config.yaml
: Training parameters and dataset configurations
Command-line arguments override configuration file values.
If you use this code in your research, please cite our paper:
@article{panfilov2025scalinglaws,
title={Capability-Based Scaling Laws for LLM Red-Teaming},
author={Alexander Panfilov and Paul Kassianik and Maksym Andriushchenko and Jonas Geiping},
year={2025},
journal={arXiv preprint arXiv:2505.20162},
url={https://arxiv.org/abs/2505.20162},
}
This work would not be possible without the contributions of the following open-source projects: