mirror of
https://github.com/mudler/LocalAI.git
synced 2024-12-18 20:27:57 +00:00
c89271b2e4
* feat(llama.cpp): support distributed llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: let tweak how chat messages are merged together Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Makefile: register to ALL_GRPC_BACKENDS Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring, allow disable auto-detection of backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * minor fixups Signed-off-by: mudler <mudler@localai.io> * feat: add cmd to start rpc-server from llama.cpp Signed-off-by: mudler <mudler@localai.io> * ci: add ccache Signed-off-by: mudler <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: mudler <mudler@localai.io>
95 lines
3.7 KiB
Bash
95 lines
3.7 KiB
Bash
## Set number of threads.
|
|
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
|
|
# LOCALAI_THREADS=14
|
|
|
|
## Specify a different bind address (defaults to ":8080")
|
|
# LOCALAI_ADDRESS=127.0.0.1:8080
|
|
|
|
## Default models context size
|
|
# LOCALAI_CONTEXT_SIZE=512
|
|
#
|
|
## Define galleries.
|
|
## models will to install will be visible in `/models/available`
|
|
# LOCALAI_GALLERIES=[{"name":"localai", "url":"github:mudler/LocalAI/gallery/index.yaml@master"}]
|
|
|
|
## CORS settings
|
|
# LOCALAI_CORS=true
|
|
# LOCALAI_CORS_ALLOW_ORIGINS=*
|
|
|
|
## Default path for models
|
|
#
|
|
# LOCALAI_MODELS_PATH=/models
|
|
|
|
## Enable debug mode
|
|
# LOCALAI_LOG_LEVEL=debug
|
|
|
|
## Disables COMPEL (Diffusers)
|
|
# COMPEL=0
|
|
|
|
## Enable/Disable single backend (useful if only one GPU is available)
|
|
# LOCALAI_SINGLE_ACTIVE_BACKEND=true
|
|
|
|
## Specify a build type. Available: cublas, openblas, clblas.
|
|
## cuBLAS: This is a GPU-accelerated version of the complete standard BLAS (Basic Linear Algebra Subprograms) library. It's provided by Nvidia and is part of their CUDA toolkit.
|
|
## OpenBLAS: This is an open-source implementation of the BLAS library that aims to provide highly optimized code for various platforms. It includes support for multi-threading and can be compiled to use hardware-specific features for additional performance. OpenBLAS can run on many kinds of hardware, including CPUs from Intel, AMD, and ARM.
|
|
## clBLAS: This is an open-source implementation of the BLAS library that uses OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. clBLAS is designed to take advantage of the parallel computing power of GPUs but can also run on any hardware that supports OpenCL. This includes hardware from different vendors like Nvidia, AMD, and Intel.
|
|
# BUILD_TYPE=openblas
|
|
|
|
## Uncomment and set to true to enable rebuilding from source
|
|
# REBUILD=true
|
|
|
|
## Enable go tags, available: stablediffusion, tts
|
|
## stablediffusion: image generation with stablediffusion
|
|
## tts: enables text-to-speech with go-piper
|
|
## (requires REBUILD=true)
|
|
#
|
|
# GO_TAGS=stablediffusion
|
|
|
|
## Path where to store generated images
|
|
# LOCALAI_IMAGE_PATH=/tmp/generated/images
|
|
|
|
## Specify a default upload limit in MB (whisper)
|
|
# LOCALAI_UPLOAD_LIMIT=15
|
|
|
|
## List of external GRPC backends (note on the container image this variable is already set to use extra backends available in extra/)
|
|
# LOCALAI_EXTERNAL_GRPC_BACKENDS=my-backend:127.0.0.1:9000,my-backend2:/usr/bin/backend.py
|
|
|
|
### Advanced settings ###
|
|
### Those are not really used by LocalAI, but from components in the stack ###
|
|
##
|
|
### Preload libraries
|
|
# LD_PRELOAD=
|
|
|
|
### Huggingface cache for models
|
|
# HUGGINGFACE_HUB_CACHE=/usr/local/huggingface
|
|
|
|
### Python backends GRPC max workers
|
|
### Default number of workers for GRPC Python backends.
|
|
### This actually controls wether a backend can process multiple requests or not.
|
|
# PYTHON_GRPC_MAX_WORKERS=1
|
|
|
|
### Define the number of parallel LLAMA.cpp workers (Defaults to 1)
|
|
# LLAMACPP_PARALLEL=1
|
|
|
|
### Define a list of GRPC Servers for llama-cpp workers to distribute the load
|
|
# https://github.com/ggerganov/llama.cpp/pull/6829
|
|
# https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md
|
|
# LLAMACPP_GRPC_SERVERS=""
|
|
|
|
### Enable to run parallel requests
|
|
# LOCALAI_PARALLEL_REQUESTS=true
|
|
|
|
### Watchdog settings
|
|
###
|
|
# Enables watchdog to kill backends that are inactive for too much time
|
|
# LOCALAI_WATCHDOG_IDLE=true
|
|
#
|
|
# Time in duration format (e.g. 1h30m) after which a backend is considered idle
|
|
# LOCALAI_WATCHDOG_IDLE_TIMEOUT=5m
|
|
#
|
|
# Enables watchdog to kill backends that are busy for too much time
|
|
# LOCALAI_WATCHDOG_BUSY=true
|
|
#
|
|
# Time in duration format (e.g. 1h30m) after which a backend is considered busy
|
|
# LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m
|