LocalAI/docs/content/getting_started/_index.en.md
2023-12-17 19:02:13 +01:00

16 KiB

+++ disableToc = false title = "Getting started" weight = 1 url = '/basics/getting_started/' +++

LocalAI is available as a container image and binary. It can be used with docker, podman, kubernetes and any container engine. You can check out all the available images with corresponding tags here.

See also our [How to]({{%relref "howtos" %}}) section for end-to-end guided examples curated by the community.

How to get started

The easiest way to run LocalAI is by using docker compose or with Docker (to build locally, see the [build section]({{%relref "build" %}})).

{{% notice note %}} To run with GPU Accelleration, see [GPU acceleration]({{%relref "features/gpu-acceleration" %}}). {{% /notice %}}

{{< tabs >}} {{% tab name="Docker" %}}

# Prepare the models into the `model` directory
mkdir models

# copy your models to it
cp your-model.gguf models/

# run the LocalAI container
docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
# You should see:
# 
# ┌───────────────────────────────────────────────────┐
# │                   Fiber v2.42.0                   │
# │               http://127.0.0.1:8080               │
# │       (bound on host 0.0.0.0 and port 8080)       │
# │                                                   │
# │ Handlers ............. 1  Processes ........... 1 │
# │ Prefork ....... Disabled  PID ................. 1 │
# └───────────────────────────────────────────────────┘

# Try the endpoint with curl
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "your-model.gguf",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'

{{% notice note %}}

  • If running on Apple Silicon (ARM) it is not suggested to run on Docker due to emulation. Follow the [build instructions]({{%relref "build" %}}) to use Metal acceleration for full GPU support.
  • If you are running Apple x86_64 you can use docker, there is no additional gain into building it from source. {{% /notice %}}

{{% /tab %}} {{% tab name="Docker compose" %}}

# Clone LocalAI
git clone https://github.com/go-skynet/LocalAI

cd LocalAI

# (optional) Checkout a specific LocalAI tag
# git checkout -b build <TAG>

# copy your models to models/
cp your-model.gguf models/

# (optional) Edit the .env file to set things like context size and threads
# vim .env

# start with docker compose
docker compose up -d --pull always
# or you can build the images with:
# docker compose up -d --build

# Now API is accessible at localhost:8080
curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"your-model.gguf","object":"model"}]}

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "your-model.gguf",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'

Note: If you are on Windows, please make sure the project is on the Linux Filesystem, otherwise loading models might be slow. For more Info: Microsoft Docs

{{% /tab %}}

{{% tab name="Kubernetes" %}}

For installing LocalAI in Kubernetes, you can use the following helm chart:

# Install the helm repository
helm repo add go-skynet https://go-skynet.github.io/helm-charts/
# Update the repositories
helm repo update
# Get the values
helm show values go-skynet/local-ai > values.yaml

# Edit the values value if needed
# vim values.yaml ...

# Install the helm chart
helm install local-ai go-skynet/local-ai -f values.yaml

{{% /tab %}}

{{< /tabs >}}

Container images

LocalAI has a set of images to support CUDA, ffmpeg and 'vanilla' (CPU-only). The image list is on quay:

{{< tabs >}} {{% tab name="Vanilla / CPU Images" %}}

  • master
  • latest
  • {{< version >}}
  • {{< version >}}-ffmpeg
  • {{< version >}}-ffmpeg-core

Core Images - Smaller images without predownload python dependencies {{% /tab %}}

{{% tab name="GPU Images CUDA 11" %}}

  • master-cublas-cuda11
  • master-cublas-cuda11-core
  • {{< version >}}-cublas-cuda11
  • {{< version >}}-cublas-cuda11-core
  • {{< version >}}-cublas-cuda11-ffmpeg
  • {{< version >}}-cublas-cuda11-ffmpeg-core

Core Images - Smaller images without predownload python dependencies {{% /tab %}}

{{% tab name="GPU Images CUDA 12" %}}

  • master-cublas-cuda12
  • master-cublas-cuda12-core
  • {{< version >}}-cublas-cuda12
  • {{< version >}}-cublas-cuda12-core
  • {{< version >}}-cublas-cuda12-ffmpeg
  • {{< version >}}-cublas-cuda12-ffmpeg-core

Core Images - Smaller images without predownload python dependencies

{{% /tab %}}

{{< /tabs >}}

Example:

  • Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:latest
  • FFmpeg: quay.io/go-skynet/local-ai:{{< version >}}-ffmpeg
  • CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda11-ffmpeg
  • CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda12-ffmpeg

{{% notice note %}} Note: the binary inside the image is pre-compiled, and might not suite all CPUs. To enable CPU optimizations for the execution environment, the default behavior is to rebuild when starting the container. To disable this auto-rebuild behavior, set the environment variable REBUILD to false.

See [docs on all environment variables]({{%relref "advanced#environment-variables" %}}) for more info. {{% /notice %}}

Example: Use luna-ai-llama2 model with docker

mkdir models

# Download luna-ai-llama2 to models/
wget https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF/resolve/main/luna-ai-llama2-uncensored.Q4_0.gguf -O models/luna-ai-llama2

# Use a template from the examples
cp -rf prompt-templates/getting_started.tmpl models/luna-ai-llama2.tmpl

docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4

# Now API is accessible at localhost:8080
curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"luna-ai-llama2","object":"model"}]}

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "luna-ai-llama2",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9
   }'

# {"model":"luna-ai-llama2","choices":[{"message":{"role":"assistant","content":"I'm doing well, thanks. How about you?"}}]}

To see other model configurations, see also the example section here.

From binaries

LocalAI binary releases are available in Github.

You can control LocalAI with command line arguments, to specify a binding address, or the number of threads.

CLI parameters

Parameter Environmental Variable Default Variable Description
--f16 $F16 false Enable f16 mode
--debug $DEBUG false Enable debug mode
--cors $CORS false Enable CORS support
--cors-allow-origins value $CORS_ALLOW_ORIGINS Specify origins allowed for CORS
--threads value $THREADS 4 Number of threads to use for parallel computation
--models-path value $MODELS_PATH ./models Path to the directory containing models used for inferencing
--preload-models value $PRELOAD_MODELS List of models to preload in JSON format at startup
--preload-models-config value $PRELOAD_MODELS_CONFIG A config with a list of models to apply at startup. Specify the path to a YAML config file
--config-file value $CONFIG_FILE Path to the config file
--address value $ADDRESS :8080 Specify the bind address for the API server
--image-path value $IMAGE_PATH Path to the directory used to store generated images
--context-size value $CONTEXT_SIZE 512 Default context size of the model
--upload-limit value $UPLOAD_LIMIT 15 Default upload limit in megabytes (audio file upload)
--galleries $GALLERIES Allows to set galleries from command line
--parallel-requests $PARALLEL_REQUESTS false Enable backends to handle multiple requests in parallel. This is for backends that supports multiple requests in parallel, like llama.cpp or vllm
--single-active-backend $SINGLE_ACTIVE_BACKEND false Allow only one backend to be running
--api-keys value $API_KEY empty List of API Keys to enable API authentication. When this is set, all the requests must be authenticated with one of these API keys.
--enable-watchdog-idle $WATCHDOG_IDLE false Enable watchdog for stopping idle backends. This will stop the backends if are in idle state for too long. (default: false) [$WATCHDOG_IDLE]
--enable-watchdog-busy $WATCHDOG_BUSY false Enable watchdog for stopping busy backends that exceed a defined threshold.
--watchdog-busy-timeout value $WATCHDOG_BUSY_TIMEOUT 5m Watchdog timeout. This will restart the backend if it crashes.
--watchdog-idle-timeout value $WATCHDOG_IDLE_TIMEOUT 15m Watchdog idle timeout. This will restart the backend if it crashes.
--preload-backend-only $PRELOAD_BACKEND_ONLY false If set, the api is NOT launched, and only the preloaded models / backends are started. This is intended for multi-node setups.
--external-grpc-backends EXTERNAL_GRPC_BACKENDS none Comma separated list of external gRPC backends to use. Format: name:host:port or name:/path/to/file

Run LocalAI in Kubernetes

LocalAI can be installed inside Kubernetes with helm.

Requirements:

  • SSD storage class, or disable mmap to load the whole model in memory
By default, the helm chart will install LocalAI instance using the ggml-gpt4all-j model without persistent storage.
  1. Add the helm repo
    helm repo add go-skynet https://go-skynet.github.io/helm-charts/
    
  2. Install the helm chart:
    helm repo update
    helm install local-ai go-skynet/local-ai -f values.yaml
    

Note: For further configuration options, see the helm chart repository on GitHub.

Example values

Deploy a single LocalAI pod with 6GB of persistent storage serving up a ggml-gpt4all-j model with custom prompt.

### values.yaml

replicaCount: 1

deployment:
  image: quay.io/go-skynet/local-ai:latest ##(This is for CPU only, to use GPU change it to a image that supports GPU IE "v2.0.0-cublas-cuda12-core")
  env:
    threads: 4
    context_size: 512
  modelsPath: "/models"

resources:
  {}
  # We usually recommend not to specify default resources and to leave this as a conscious
  # choice for the user. This also increases chances charts run on environments with little
  # resources, such as Minikube. If you do want to specify resources, uncomment the following
  # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi

# Prompt templates to include
# Note: the keys of this map will be the names of the prompt template files
promptTemplates:
  {}
  # ggml-gpt4all-j.tmpl: |
  #   The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
  #   ### Prompt:
  #   {{.Input}}
  #   ### Response:

# Models to download at runtime
models:
  # Whether to force download models even if they already exist
  forceDownload: false

  # The list of URLs to download models from
  # Note: the name of the file will be the name of the loaded model
  list:
  - url: "https://gpt4all.io/models/ggml-gpt4all-j.bin"
      # basicAuth: base64EncodedCredentials

  # Persistent storage for models and prompt templates.
  # PVC and HostPath are mutually exclusive. If both are enabled,
  # PVC configuration takes precedence. If neither are enabled, ephemeral
  # storage is used.
  persistence:
    pvc:
      enabled: false
      size: 6Gi
      accessModes:
        - ReadWriteOnce

      annotations: {}

      # Optional
      storageClass: ~

    hostPath:
      enabled: false
      path: "/models"

service:
  type: ClusterIP
  port: 80
  annotations: {}
  # If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout
  # service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"

ingress:
  enabled: false
  className: ""
  annotations:
    {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

nodeSelector: {}

tolerations: []

affinity: {}

Build from source

See the [build section]({{%relref "build" %}}).

Other examples

Screenshot from 2023-04-26 23-59-55

To see other examples on how to integrate with other projects for instance for question answering or for using it with chatbot-ui, see: examples.

Clients

OpenAI clients are already compatible with LocalAI by overriding the basePath, or the target URL.

Javascript

https://github.com/openai/openai-node/

import { Configuration, OpenAIApi } from 'openai';

const configuration = new Configuration({
  basePath: `http://localhost:8080/v1`
});
const openai = new OpenAIApi(configuration);

Python

https://github.com/openai/openai-python

Set the OPENAI_API_BASE environment variable, or by code:

import openai

openai.api_base = "http://localhost:8080/v1"

# create a chat completion
chat_completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hello world"}])

# print the completion
print(completion.choices[0].message.content)