GitHub - yowolf/Limitations-of-Alignment-in-LLMs: Code used in the paper

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
behavior_expectation_misalignment_graphs		behavior_expectation_misalignment_graphs
beta_sigma_calculations		beta_sigma_calculations
kl_divergence_calculations		kl_divergence_calculations
lora_finetuning		lora_finetuning
README.txt		README.txt
requirements.txt		requirements.txt

Repository files navigation

This directory contains the code for the experiments in the paper "Fundamental Limitations of Alignment in Large Language Models" (https://arxiv.org/pdf/2304.11082).

1) behavior_expectation_misalignment_graphs - used to create the figures that demonstrate misalignment of an RLHF model and a pretrained model with our prompting method in the empirical section.
- misalignment_pretrained.py / misalignmend_RHLF.py - generates prompts from a proxy negative component for agreeableness and anti-immigration and generates text from the original model when conditioned upon these prompts to check the alignment of the model.
- agreeableness / anti-immigration - directories containing the generated responses above tagged as aligned or misaligned.
- alignment_length_pretrained.xlsx / alignment_length_RLHF.xlsx - summary of the generated prompts and the percentage of aligned and misaligned responses of the model as a function of prompt length.

2) beta_sigma_calculations - used to calculate the proposed values of beta and sigma in the empirical section.
- beta_calculation.py - calculation of KL divergence between a negative behavior proxy component and a positive behavior proxy component, used to estimate beta.
- sigma_calculation.py - calculation of KL divergence between a negative behavior proxy component and a positive behavior proxy component, used to estimate sigma.
- beta_sigma_calculations.xlsx - values of beta and sigma calculated in above two files along with generated prompts.

3) kl_divergence_calculations - used to create figures that demonstrate the convergence of an RLHF model to a negative behavior proxy component in terms of KL divergence. Also presented in the empirical section.
- behavior_convergence_sentence_wise_RLHF.py - calculates the conditional KL divergence between a model and a negative behavior proxy component.
- behavior_convergence_sentence_wise_pretrained.py - calculates the conditional KL divergence between a model and a negative behavior proxy component.
- kl_divergence_calculations.xlsx - values of KL calculated in above two files.

4) lora_finetuning - code for lora_finetuning the model on different behaviors.

Warning, some of the textual content in these files generated by the language models may be found offensive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

yowolf/Limitations-of-Alignment-in-LLMs

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages