mirror of
https://github.com/mudler/LocalAI.git
synced 2025-06-23 00:49:06 +00:00
Some checks failed
Explorer deployment / build-linux (push) Has been cancelled
GPU tests / ubuntu-latest (1.21.x) (push) Has been cancelled
generate and publish intel docker caches / generate_caches (intel/oneapi-basekit:2025.1.0-0-devel-ubuntu22.04, linux/amd64, ubuntu-latest) (push) Has been cancelled
build container images / hipblas-jobs (-aio-gpu-hipblas, rocm/dev-ubuntu-22.04:6.1, hipblas, true, ubuntu:22.04, extras, latest-gpu-hipblas-extras, latest-aio-gpu-hipblas, --jobs=3 --output-sync=target, linux/amd64, arc-runner-set, auto, -hipblas-extras) (push) Has been cancelled
build container images / hipblas-jobs (rocm/dev-ubuntu-22.04:6.1, hipblas, true, ubuntu:22.04, core, latest-gpu-hipblas, --jobs=3 --output-sync=target, linux/amd64, arc-runner-set, false, -hipblas) (push) Has been cancelled
build container images / self-hosted-jobs (-aio-gpu-intel-f16, quay.io/go-skynet/intel-oneapi-base:latest, sycl_f16, true, ubuntu:22.04, extras, latest-gpu-intel-f16-extras, latest-aio-gpu-intel-f16, --jobs=3 --output-sync=target, linux/amd64, arc-runner-set, false, -sycl-f16-… (push) Has been cancelled
build container images / self-hosted-jobs (-aio-gpu-intel-f32, quay.io/go-skynet/intel-oneapi-base:latest, sycl_f32, true, ubuntu:22.04, extras, latest-gpu-intel-f32-extras, latest-aio-gpu-intel-f32, --jobs=3 --output-sync=target, linux/amd64, arc-runner-set, false, -sycl-f32-… (push) Has been cancelled
build container images / self-hosted-jobs (-aio-gpu-nvidia-cuda-11, ubuntu:22.04, cublas, 11, 7, true, extras, latest-gpu-nvidia-cuda-11-extras, latest-aio-gpu-nvidia-cuda-11, --jobs=3 --output-sync=target, linux/amd64, arc-runner-set, false, -cublas-cuda11-extras) (push) Has been cancelled
build container images / self-hosted-jobs (-aio-gpu-nvidia-cuda-12, ubuntu:22.04, cublas, 12, 0, true, extras, latest-gpu-nvidia-cuda-12-extras, latest-aio-gpu-nvidia-cuda-12, --jobs=3 --output-sync=target, linux/amd64, arc-runner-set, false, -cublas-cuda12-extras) (push) Has been cancelled
build container images / self-hosted-jobs (quay.io/go-skynet/intel-oneapi-base:latest, sycl_f16, true, ubuntu:22.04, core, latest-gpu-intel-f16, --jobs=3 --output-sync=target, linux/amd64, arc-runner-set, false, -sycl-f16) (push) Has been cancelled
build container images / self-hosted-jobs (quay.io/go-skynet/intel-oneapi-base:latest, sycl_f32, true, ubuntu:22.04, core, latest-gpu-intel-f32, --jobs=3 --output-sync=target, linux/amd64, arc-runner-set, false, -sycl-f32) (push) Has been cancelled
build container images / core-image-build (-aio-cpu, ubuntu:22.04, , true, core, latest-cpu, latest-aio-cpu, --jobs=4 --output-sync=target, linux/amd64,linux/arm64, arc-runner-set, false, auto, ) (push) Has been cancelled
build container images / core-image-build (ubuntu:22.04, cublas, 11, 7, true, core, latest-gpu-nvidia-cuda-12, --jobs=4 --output-sync=target, linux/amd64, arc-runner-set, false, false, -cublas-cuda11) (push) Has been cancelled
build container images / core-image-build (ubuntu:22.04, cublas, 12, 0, true, core, latest-gpu-nvidia-cuda-12, --jobs=4 --output-sync=target, linux/amd64, arc-runner-set, false, false, -cublas-cuda12) (push) Has been cancelled
build container images / core-image-build (ubuntu:22.04, vulkan, true, core, latest-gpu-vulkan, --jobs=4 --output-sync=target, linux/amd64, arc-runner-set, false, false, -vulkan) (push) Has been cancelled
build container images / gh-runner (nvcr.io/nvidia/l4t-jetpack:r36.4.0, cublas, 12, 0, true, core, latest-nvidia-l4t-arm64, --jobs=4 --output-sync=target, linux/arm64, ubuntu-24.04-arm, true, false, -nvidia-l4t-arm64) (push) Has been cancelled
Security Scan / tests (push) Has been cancelled
Tests extras backends / tests-transformers (push) Has been cancelled
Tests extras backends / tests-rerankers (push) Has been cancelled
Tests extras backends / tests-diffusers (push) Has been cancelled
Tests extras backends / tests-coqui (push) Has been cancelled
tests / tests-linux (1.21.x) (push) Has been cancelled
tests / tests-aio-container (push) Has been cancelled
tests / tests-apple (1.21.x) (push) Has been cancelled
Update swagger / swagger (push) Has been cancelled
Check if checksums are up-to-date / checksum_check (push) Has been cancelled
Bump dependencies / bump (mudler/LocalAI) (push) Has been cancelled
Bump dependencies / bump (main, PABannier/bark.cpp, BARKCPP_VERSION) (push) Has been cancelled
Bump dependencies / bump (master, ggml-org/llama.cpp, CPPLLAMA_VERSION) (push) Has been cancelled
Bump dependencies / bump (master, ggml-org/whisper.cpp, WHISPER_CPP_VERSION) (push) Has been cancelled
Bump dependencies / bump (master, leejet/stable-diffusion.cpp, STABLEDIFFUSION_GGML_VERSION) (push) Has been cancelled
Bump dependencies / bump (master, mudler/go-piper, PIPER_VERSION) (push) Has been cancelled
Bump dependencies / bump (master, mudler/go-stable-diffusion, STABLEDIFFUSION_VERSION) (push) Has been cancelled
generate and publish GRPC docker caches / generate_caches (ubuntu:22.04, linux/amd64,linux/arm64, arc-runner-set) (push) Has been cancelled
updating the documentation on fine tuning and advanced guide. This mirrors how modern version of llama.cpp operate
136 lines
5.2 KiB
Markdown
136 lines
5.2 KiB
Markdown
|
|
+++
|
|
disableToc = false
|
|
title = "Fine-tuning LLMs for text generation"
|
|
weight = 22
|
|
+++
|
|
|
|
{{% alert note %}}
|
|
Section under construction
|
|
{{% /alert %}}
|
|
|
|
This section covers how to fine-tune a language model for text generation and consume it in LocalAI.
|
|
|
|
[](https://colab.research.google.com/github/mudler/LocalAI/blob/master/examples/e2e-fine-tuning/notebook.ipynb)
|
|
|
|
## Requirements
|
|
|
|
For this example you will need at least a 12GB VRAM of GPU and a Linux box.
|
|
|
|
## Fine-tuning
|
|
|
|
Fine-tuning a language model is a process that requires a lot of computational power and time.
|
|
|
|
Currently LocalAI doesn't support the fine-tuning endpoint as LocalAI but there are are [plans](https://github.com/mudler/LocalAI/issues/596) to support that. For the time being a guide is proposed here to give a simple starting point on how to fine-tune a model and use it with LocalAI (but also with llama.cpp).
|
|
|
|
There is an e2e example of fine-tuning a LLM model to use with [LocalAI](https://github.com/mudler/LocalAI) written by [@mudler](https://github.com/mudler) available [here](https://github.com/mudler/LocalAI/tree/master/examples/e2e-fine-tuning/).
|
|
|
|
The steps involved are:
|
|
|
|
- Preparing a dataset
|
|
- Prepare the environment and install dependencies
|
|
- Fine-tune the model
|
|
- Merge the Lora base with the model
|
|
- Convert the model to gguf
|
|
- Use the model with LocalAI
|
|
|
|
## Dataset preparation
|
|
|
|
We are going to need a dataset or a set of datasets.
|
|
|
|
Axolotl supports a variety of formats, in the notebook and in this example we are aiming for a very simple dataset and build that manually, so we are going to use the `completion` format which requires the full text to be used for fine-tuning.
|
|
|
|
A dataset for an instructor model (like Alpaca) can look like the following:
|
|
|
|
```json
|
|
[
|
|
{
|
|
"text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
|
|
},
|
|
{
|
|
"text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
|
|
}
|
|
]
|
|
```
|
|
|
|
Every block in the text is the whole text that is used to fine-tune. For example, for an instructor model it follows the following format (more or less):
|
|
|
|
```
|
|
<System prompt>
|
|
|
|
## Instruction
|
|
|
|
<Question, instruction>
|
|
|
|
## Response
|
|
|
|
<Expected response from the LLM>
|
|
```
|
|
|
|
The instruction format works such as when we are going to inference with the model, we are going to feed it only the first part up to the `## Instruction` block, and the model is going to complete the text with the `## Response` block.
|
|
|
|
Prepare a dataset, and upload it to your Google Drive in case you are using the Google colab. Otherwise place it next the `axolotl.yaml` file as `dataset.json`.
|
|
|
|
### Install dependencies
|
|
|
|
```bash
|
|
# Install axolotl and dependencies
|
|
git clone https://github.com/OpenAccess-AI-Collective/axolotl && pushd axolotl && git checkout 797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a && popd #0.3.0
|
|
pip install packaging
|
|
pushd axolotl && pip install -e '.[flash-attn,deepspeed]' && popd
|
|
|
|
# https://github.com/oobabooga/text-generation-webui/issues/4238
|
|
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
|
```
|
|
|
|
Configure accelerate:
|
|
|
|
```bash
|
|
accelerate config default
|
|
```
|
|
|
|
## Fine-tuning
|
|
|
|
We will need to configure axolotl. In this example is provided a file to use `axolotl.yaml` that uses openllama-3b for fine-tuning. Copy the `axolotl.yaml` file and edit it to your needs. The dataset needs to be next to it as `dataset.json`. You can find the axolotl.yaml file [here](https://github.com/mudler/LocalAI/tree/master/examples/e2e-fine-tuning/).
|
|
|
|
If you have a big dataset, you can pre-tokenize it to speedup the fine-tuning process:
|
|
|
|
```bash
|
|
# Optional pre-tokenize (run only if big dataset)
|
|
python -m axolotl.cli.preprocess axolotl.yaml
|
|
```
|
|
|
|
Now we are ready to start the fine-tuning process:
|
|
```bash
|
|
# Fine-tune
|
|
accelerate launch -m axolotl.cli.train axolotl.yaml
|
|
```
|
|
|
|
After we have finished the fine-tuning, we merge the Lora base with the model:
|
|
```bash
|
|
# Merge lora
|
|
python3 -m axolotl.cli.merge_lora axolotl.yaml --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False
|
|
```
|
|
|
|
And we convert it to the gguf format that LocalAI can consume:
|
|
|
|
```bash
|
|
|
|
# Convert to gguf
|
|
git clone https://github.com/ggerganov/llama.cpp.git
|
|
pushd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release && popd
|
|
|
|
# We need to convert the pytorch model into ggml for quantization
|
|
# It crates 'ggml-model-f16.bin' in the 'merged' directory.
|
|
pushd llama.cpp && python3 convert_hf_to_gguf.py ../qlora-out/merged && popd
|
|
|
|
# Start off by making a basic q4_0 4-bit quantization.
|
|
# It's important to have 'ggml' in the name of the quant for some
|
|
# software to recognize it's file format.
|
|
pushd llama.cpp/build/bin && ./llama-quantize ../../../qlora-out/merged/Merged-33B-F16.gguf \
|
|
../../../custom-model-q4_0.gguf q4_0
|
|
|
|
```
|
|
|
|
Now you should have ended up with a `custom-model-q4_0.gguf` file that you can copy in the LocalAI models directory and use it with LocalAI.
|