|
| 1 | + |
| 2 | ++++ |
| 3 | +disableToc = false |
| 4 | +title = "Fine-tuning LLMs for text generation" |
| 5 | +weight = 3 |
| 6 | ++++ |
| 7 | + |
| 8 | +{{% notice note %}} |
| 9 | +Section under construction |
| 10 | +{{% /notice %}} |
| 11 | + |
| 12 | +This section covers how to fine-tune a language model for text generation and consume it in LocalAI. |
| 13 | + |
| 14 | +## Requirements |
| 15 | + |
| 16 | +For this example you will need at least a 12GB VRAM of GPU and a Linux box. |
| 17 | + |
| 18 | +## Fine-tuning |
| 19 | + |
| 20 | +Fine-tuning a language model is a process that requires a lot of computational power and time. |
| 21 | + |
| 22 | +Currently LocalAI doesn't support the fine-tuning endpoint as LocalAI but there are are [plans](https://github.com/mudler/LocalAI/issues/596) to support that. For the time being a guide is proposed here to give a simple starting point on how to fine-tune a model and use it with LocalAI (but also with llama.cpp). |
| 23 | + |
| 24 | +There is an e2e example of fine-tuning a LLM model to use with [LocalAI](https://github/mudler/LocalAI) written by [@mudler](https://github.com/mudler) available [here](https://github.com/mudler/LocalAI/tree/master/examples/e2e-fine-tuning/). |
| 25 | + |
| 26 | +The steps involved are: |
| 27 | + |
| 28 | +- Preparing a dataset |
| 29 | +- Prepare the environment and install dependencies |
| 30 | +- Fine-tune the model |
| 31 | +- Merge the Lora base with the model |
| 32 | +- Convert the model to gguf |
| 33 | +- Use the model with LocalAI |
| 34 | + |
| 35 | +## Dataset preparation |
| 36 | + |
| 37 | +We are going to need a dataset or a set of datasets. |
| 38 | + |
| 39 | +Axolotl supports a variety of formats, in the notebook and in this example we are aiming for a very simple dataset and build that manually, so we are going to use the `completion` format which requires the full text to be used for fine-tuning. |
| 40 | + |
| 41 | +A dataset for an instructor model (like Alpaca) can look like the following: |
| 42 | + |
| 43 | +```json |
| 44 | +[ |
| 45 | + { |
| 46 | + "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...", |
| 47 | + }, |
| 48 | + { |
| 49 | + "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...", |
| 50 | + } |
| 51 | +] |
| 52 | +``` |
| 53 | + |
| 54 | +Every block in the text is the whole text that is used to fine-tune. For example, for an instructor model it follows the following format (more or less): |
| 55 | + |
| 56 | +``` |
| 57 | +<System prompt> |
| 58 | +
|
| 59 | +## Instruction |
| 60 | +
|
| 61 | +<Question, instruction> |
| 62 | +
|
| 63 | +## Response |
| 64 | +
|
| 65 | +<Expected response from the LLM> |
| 66 | +``` |
| 67 | + |
| 68 | +The instruction format works such as when we are going to inference with the model, we are going to feed it only the first part up to the `## Instruction` block, and the model is going to complete the text with the `## Response` block. |
| 69 | + |
| 70 | +Prepare a dataset, and upload it to your Google Drive in case you are using the Google colab. Otherwise place it next the `axolotl.yaml` file as `dataset.json`. |
| 71 | + |
| 72 | +### Install dependencies |
| 73 | + |
| 74 | +```bash |
| 75 | +# Install axolotl and dependencies |
| 76 | +git clone https://github.com/OpenAccess-AI-Collective/axolotl && pushd axolotl && git checkout 797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a && popd #0.3.0 |
| 77 | +pip install packaging |
| 78 | +pushd axolotl && pip install -e '.[flash-attn,deepspeed]' && popd |
| 79 | + |
| 80 | +# https://github.com/oobabooga/text-generation-webui/issues/4238 |
| 81 | +pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl |
| 82 | +``` |
| 83 | + |
| 84 | +Configure accelerate: |
| 85 | + |
| 86 | +```bash |
| 87 | +accelerate config default |
| 88 | +``` |
| 89 | + |
| 90 | +## Fine-tuning |
| 91 | + |
| 92 | +We will need to configure axolotl. In this example is provided a file to use `axolotl.yaml` that uses openllama-3b for fine-tuning. Copy the `axolotl.yaml` file and edit it to your needs. The dataset needs to be next to it as `dataset.json`. You can find the axolotl.yaml file [here](https://github.com/mudler/LocalAI/tree/master/examples/e2e-fine-tuning/). |
| 93 | + |
| 94 | +If you have a big dataset, you can pre-tokenize it to speedup the fine-tuning process: |
| 95 | + |
| 96 | +```bash |
| 97 | +# Optional pre-tokenize (run only if big dataset) |
| 98 | +python -m axolotl.cli.preprocess axolotl.yaml |
| 99 | +``` |
| 100 | + |
| 101 | +Now we are ready to start the fine-tuning process: |
| 102 | +```bash |
| 103 | +# Fine-tune |
| 104 | +accelerate launch -m axolotl.cli.train axolotl.yaml |
| 105 | +``` |
| 106 | + |
| 107 | +After we have finished the fine-tuning, we merge the Lora base with the model: |
| 108 | +```bash |
| 109 | +# Merge lora |
| 110 | +python3 -m axolotl.cli.merge_lora axolotl.yaml --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False |
| 111 | +``` |
| 112 | + |
| 113 | +And we convert it to the gguf format that LocalAI can consume: |
| 114 | + |
| 115 | +```bash |
| 116 | + |
| 117 | +# Convert to gguf |
| 118 | +git clone https://github.com/ggerganov/llama.cpp.git |
| 119 | +pushd llama.cpp && make LLAMA_CUBLAS=1 && popd |
| 120 | + |
| 121 | +# We need to convert the pytorch model into ggml for quantization |
| 122 | +# It crates 'ggml-model-f16.bin' in the 'merged' directory. |
| 123 | +pushd llama.cpp && python convert.py --outtype f16 \ |
| 124 | + ../qlora-out/merged/pytorch_model-00001-of-00002.bin && popd |
| 125 | + |
| 126 | +# Start off by making a basic q4_0 4-bit quantization. |
| 127 | +# It's important to have 'ggml' in the name of the quant for some |
| 128 | +# software to recognize it's file format. |
| 129 | +pushd llama.cpp && ./quantize ../qlora-out/merged/ggml-model-f16.gguf \ |
| 130 | + ../custom-model-q4_0.bin q4_0 |
| 131 | + |
| 132 | +``` |
| 133 | + |
| 134 | +Now you should have ended up with a `custom-model-q4_0.bin` file that you can copy in the LocalAI models directory and use it with LocalAI. |
0 commit comments