-
Notifications
You must be signed in to change notification settings - Fork 1
yowolf/Limitations-of-Alignment-in-LLMs
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This directory contains the code for the experiments in the paper "Fundamental Limitations of Alignment in Large Language Models" (https://arxiv.org/pdf/2304.11082). 1) behavior_expectation_misalignment_graphs - used to create the figures that demonstrate misalignment of an RLHF model and a pretrained model with our prompting method in the empirical section. - misalignment_pretrained.py / misalignmend_RHLF.py - generates prompts from a proxy negative component for agreeableness and anti-immigration and generates text from the original model when conditioned upon these prompts to check the alignment of the model. - agreeableness / anti-immigration - directories containing the generated responses above tagged as aligned or misaligned. - alignment_length_pretrained.xlsx / alignment_length_RLHF.xlsx - summary of the generated prompts and the percentage of aligned and misaligned responses of the model as a function of prompt length. 2) beta_sigma_calculations - used to calculate the proposed values of beta and sigma in the empirical section. - beta_calculation.py - calculation of KL divergence between a negative behavior proxy component and a positive behavior proxy component, used to estimate beta. - sigma_calculation.py - calculation of KL divergence between a negative behavior proxy component and a positive behavior proxy component, used to estimate sigma. - beta_sigma_calculations.xlsx - values of beta and sigma calculated in above two files along with generated prompts. 3) kl_divergence_calculations - used to create figures that demonstrate the convergence of an RLHF model to a negative behavior proxy component in terms of KL divergence. Also presented in the empirical section. - behavior_convergence_sentence_wise_RLHF.py - calculates the conditional KL divergence between a model and a negative behavior proxy component. - behavior_convergence_sentence_wise_pretrained.py - calculates the conditional KL divergence between a model and a negative behavior proxy component. - kl_divergence_calculations.xlsx - values of KL calculated in above two files. 4) lora_finetuning - code for lora_finetuning the model on different behaviors. Warning, some of the textual content in these files generated by the language models may be found offensive.
About
Code used in the paper
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published