lollms-webui/notebooks/ggml_quantize.ipynb
2024-02-27 13:03:35 +01:00

131 lines
3.5 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first step consists of compiling llama.cpp and installing the required libraries in our Python environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install llama.cpp\n",
"!git clone https://github.com/ggerganov/llama.cpp\n",
"!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make\n",
"!pip install -r llama.cpp/requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can download our model. We will use an jondurbin/airoboros-m-7b-3.1.2 model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"MODEL_ID = \"jondurbin/airoboros-m-7b-3.1.2\"\n",
"\n",
"# Download model\n",
"!git lfs install\n",
"!git clone https://huggingface.co/{MODEL_ID}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step can take a while. Once its done, we need to convert our weight to GGML FP16 format"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"MODEL_NAME = MODEL_ID.split('/')[-1]\n",
"\n",
"# Convert to fp16\n",
"fp16 = f\"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin\"\n",
"!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can quantize the model using one or several methods. In this case, we will use the Q4_K_M and Q5_K_M methods. This is the only step that actually requires a GPU."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"QUANTIZATION_METHODS = [\"q4_k_m\", \"q5_k_m\"]\n",
"\n",
"for method in QUANTIZATION_METHODS:\n",
" qtype = f\"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf\"\n",
" !./llama.cpp/quantize {fp16} {qtype} {method}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can push our quantized model to a new repo on the Hugging Face Hub with the “-GGUF” suffix. First, lets log in and modify the following code block to match your username. You can enter your Hugging Face token (https://huggingface.co/settings/tokens) in Google Colabs “Secrets” tab. We use the allow_patterns parameter to only upload GGUF models and not the entirety of the directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install -q huggingface_hub\n",
"from huggingface_hub import create_repo, HfApi\n",
"from google.colab import userdata\n",
"\n",
"# Defined in the secrets tab in Google Colab\n",
"hf_token = userdata.get('huggingface')\n",
"\n",
"api = HfApi()\n",
"username = \"parisneo\"\n",
"\n",
"# Create empty repo\n",
"create_repo(\n",
" repo_id = f\"{username}/{MODEL_NAME}-GGUF\",\n",
" repo_type=\"model\",\n",
" exist_ok=True,\n",
" token=hf_token\n",
")\n",
"\n",
"# Upload gguf files\n",
"api.upload_folder(\n",
" folder_path=MODEL_NAME,\n",
" repo_id=f\"{username}/{MODEL_NAME}-GGUF\",\n",
" allow_patterns=f\"*.gguf\",\n",
" token=hf_token\n",
")"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}