This project provides a pipeline for detoxifying large language models (LLMs) using projection alignment techniques. The method involves training a teacher model, aligning it with a target model, generating detoxified responses, and evaluating the results.
./run_pretrain.sh
Trains a teacher model on detoxification tasks using distributed training across multiple GPUs. The pretrained model will be saved in the specified output directory.
Key parameters:
model_name_or_path
: Base model to start training fromtraining_data_path
: Path to training datasetoutput_dir
: Directory to save trained modelper_device_train_batch_size
: Batch size per GPUgradient_accumulation_steps
: Gradient accumulation stepslearning_rate
: Training learning ratenum_train_epochs
: Number of training epochsdeepspeed
: DeepSpeed configuration file
./run_alignment.sh
Computes the alignment matrix between the teacher model and a target model (e.g., Llama-3) using contrastive learning.
Key parameters:
model1_name
: Path to teacher modelmodel2_name
: Path to target modeloutput_dir
: Directory to save alignment matrixbatch_size
: Training batch sizeepochs
: Number of training epochslr
: Learning raten_negatives
: Number of negative samples per positive pair
./run_generation.sh <teacher_model_path> <target_model_path> <matrix_A_path> <device> <output_path> <data_path>
Generates responses using the alignment method with different alpha values.
Parameters:
teacher_model_path
: Path to pretrained teacher modeltarget_model_path
: Path to target modelmatrix_A_path
: Path to alignment matrixdevice
: Computation device (e.g., cuda:0)output_path
: Path to save generated responsesdata_path
: Path to challenge prompts JSON file
./run_tox_score.sh <device> <input_json_path>
Computes toxicity scores using Detoxify's original model.
Parameters:
device
: Computation device (e.g., cuda:0 or cpu)input_json_path
: Path to generated responses JSON file
./run_ppl_score.sh <llama2_13b_model_path> <input_json_path>
Computes perplexity scores using Llama2-13B as the reference model.
Parameters:
llama2_13b_model_path
: Path to Llama2-13B modelinput_json_path
: Path to generated responses JSON file
- Clone the repository:
git clone https://github.com/yourusername/detoxification-projection-alignment.git
cd detoxification-projection-alignment
- Install dependencies:
pip install -r requirements.txt
- Run the full pipeline:
# Step 1: Pretrain teacher model
./run_pretrain.sh
# Step 2: Train alignment matrix
./run_alignment.sh
# Step 3: Generate responses
./run_generation.sh \
/path/to/teacher_model \
/path/to/target_model \
/path/to/alignment_matrix.pt \
cuda:0 \
./generation_results.json \
./challenge_prompts.jsonl
# Step 4: Evaluate toxicity
./run_tox_score.sh cuda:0 ./generation_results.json
# Step 4 (Alternative): Evaluate perplexity
./run_ppl_score.sh /path/to/llama2-13b ./generation_results.json
- For faster training, use higher batch sizes with gradient accumulation
- Experiment with different alpha values (0.0-0.6) for detoxification strength
- Use bfloat16 precision for faster inference on compatible hardware
- For large models, use DeepSpeed for efficient distributed training
- Toxicity Score: Lower values indicate less toxic content (range: 0-1)
- Perplexity (PPL): Lower values indicate higher text quality
- Optimal alpha values balance detoxification with text quality