Skip to content

yowolf/Limitations-of-Alignment-in-LLMs

Repository files navigation

This directory contains the code for the experiments in the paper "Fundamental Limitations of Alignment in Large Language Models" (https://arxiv.org/pdf/2304.11082).

1) behavior_expectation_misalignment_graphs - used to create the figures that demonstrate misalignment of an RLHF model and a pretrained model with our prompting method in the empirical section.
- misalignment_pretrained.py / misalignmend_RHLF.py - generates prompts from a proxy negative component for agreeableness and anti-immigration and generates text from the original model when conditioned upon these prompts to check the alignment of the model.
- agreeableness / anti-immigration - directories containing the generated responses above tagged as aligned or misaligned.
- alignment_length_pretrained.xlsx / alignment_length_RLHF.xlsx - summary of the generated prompts and the percentage of aligned and misaligned responses of the model as a function of prompt length.

2) beta_sigma_calculations - used to calculate the proposed values of beta and sigma in the empirical section.
- beta_calculation.py - calculation of KL divergence between a negative behavior proxy component and a positive behavior proxy component, used to estimate beta.
- sigma_calculation.py - calculation of KL divergence between a negative behavior proxy component and a positive behavior proxy component, used to estimate sigma.
- beta_sigma_calculations.xlsx - values of beta and sigma calculated in above two files along with generated prompts.

3) kl_divergence_calculations - used to create figures that demonstrate the convergence of an RLHF model to a negative behavior proxy component in terms of KL divergence. Also presented in the empirical section.
- behavior_convergence_sentence_wise_RLHF.py - calculates the conditional KL divergence between a model and a negative behavior proxy component.
- behavior_convergence_sentence_wise_pretrained.py - calculates the conditional KL divergence between a model and a negative behavior proxy component.
- kl_divergence_calculations.xlsx - values of KL calculated in above two files.

4) lora_finetuning - code for lora_finetuning the model on different behaviors.

Warning, some of the textual content in these files generated by the language models may be found offensive.

About

Code used in the paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages