mirror of
https://github.com/mudler/LocalAI.git
synced 2024-12-18 20:27:57 +00:00
80f50e6ccd
Signed-off-by: mudler <mudler@c3os.io>
215 lines
8.6 KiB
Markdown
215 lines
8.6 KiB
Markdown
## :camel: LocalAI
|
|
|
|
> :warning: This project has been renamed from `llama-cli` to `LocalAI` to reflect the fact that we are focusing on a fast drop-in OpenAI API rather on the CLI interface. We think that there are already many projects that can be used as a CLI interface already, for instance [llama.cpp](https://github.com/ggerganov/llama.cpp) and [gpt4all](https://github.com/nomic-ai/gpt4all). If you are were using `llama-cli` for CLI interactions and want to keep using it, use older versions or please open up an issue - contributions are welcome!
|
|
|
|
LocalAI is a straightforward, drop-in replacement API compatible with OpenAI for local CPU inferencing, based on [llama.cpp](https://github.com/ggerganov/llama.cpp), [gpt4all](https://github.com/nomic-ai/gpt4all) and [ggml](https://github.com/ggerganov/ggml), including support GPT4ALL-J which is Apache 2.0 Licensed and can be used for commercial purposes.
|
|
|
|
- OpenAI compatible API
|
|
- Supports multiple-models
|
|
- Once loaded the first time, it keep models loaded in memory for faster inference
|
|
- Provides a simple command line interface that allows text generation directly from the terminal
|
|
- Support for prompt templates
|
|
- Doesn't shell-out, but uses C bindings for a faster inference and better performance. Uses [go-llama.cpp](https://github.com/go-skynet/go-llama.cpp) and [go-gpt4all-j.cpp](https://github.com/go-skynet/go-gpt4all-j.cpp).
|
|
|
|
## Model compatibility
|
|
|
|
It is compatible with the models supported by [llama.cpp](https://github.com/ggerganov/llama.cpp) and also [GPT4ALL-J](https://github.com/nomic-ai/gpt4all).
|
|
|
|
Note: You might need to convert older models to the new format, see [here](https://github.com/ggerganov/llama.cpp#using-gpt4all) for instance to run `gpt4all`.
|
|
|
|
## Usage
|
|
|
|
> `LocalAI` comes by default as a container image. You can check out all the available images with corresponding tags [here](https://quay.io/repository/go-skynet/local-ai?tab=tags&tag=latest).
|
|
|
|
The easiest way to run LocalAI is by using `docker-compose`:
|
|
|
|
```bash
|
|
|
|
git clone https://github.com/go-skynet/LocalAI
|
|
|
|
cd LocalAI
|
|
|
|
# copy your models to models/
|
|
cp your-model.bin models/
|
|
|
|
# (optional) Edit the .env file to set things like context size and threads
|
|
# vim .env
|
|
|
|
# start with docker-compose
|
|
docker compose up -d --build
|
|
|
|
# Now API is accessible at localhost:8080
|
|
curl http://localhost:8080/v1/models
|
|
# {"object":"list","data":[{"id":"your-model.bin","object":"model"}]}
|
|
|
|
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
|
|
"model": "your-model.bin",
|
|
"prompt": "A long time ago in a galaxy far, far away",
|
|
"temperature": 0.7
|
|
}'
|
|
```
|
|
|
|
## Prompt templates
|
|
|
|
The API doesn't inject a default prompt for talking to the model. You have to use a prompt similar to what's described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release.
|
|
|
|
<details>
|
|
You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibiling file, `foo.bin.tmpl` which will be used as a default prompt, for instance this can be used with alpaca:
|
|
|
|
```
|
|
Below is an instruction that describes a task. Write a response that appropriately completes the request.
|
|
|
|
### Instruction:
|
|
{{.Input}}
|
|
|
|
### Response:
|
|
```
|
|
|
|
See the [prompt-templates](https://github.com/go-skynet/LocalAI/tree/master/prompt-templates) directory in this repository for templates for most popular models.
|
|
|
|
</details>
|
|
|
|
## API
|
|
|
|
`LocalAI` provides an API for running text generation as a service, that follows the OpenAI reference and can be used as a drop-in. The models once loaded the first time will be kept in memory.
|
|
|
|
<details>
|
|
Example of starting the API with `docker`:
|
|
|
|
```bash
|
|
docker run -p 8080:8080 -ti --rm quay.io/go-skynet/local-api:latest --models-path /path/to/models --context-size 700 --threads 4
|
|
```
|
|
|
|
And you'll see:
|
|
```
|
|
┌───────────────────────────────────────────────────┐
|
|
│ Fiber v2.42.0 │
|
|
│ http://127.0.0.1:8080 │
|
|
│ (bound on host 0.0.0.0 and port 8080) │
|
|
│ │
|
|
│ Handlers ............. 1 Processes ........... 1 │
|
|
│ Prefork ....... Disabled PID ................. 1 │
|
|
└───────────────────────────────────────────────────┘
|
|
```
|
|
|
|
Note: Models have to end up with `.bin` so can be listed by the `/models` endpoint.
|
|
|
|
You can control the API server options with command line arguments:
|
|
|
|
```
|
|
local-api --models-path <model_path> [--address <address>] [--threads <num_threads>]
|
|
```
|
|
|
|
The API takes takes the following parameters:
|
|
|
|
| Parameter | Environment Variable | Default Value | Description |
|
|
| ------------ | -------------------- | ------------- | -------------------------------------- |
|
|
| models-path | MODELS_PATH | | The path where you have models (ending with `.bin`). |
|
|
| threads | THREADS | CPU cores | The number of threads to use for text generation. |
|
|
| address | ADDRESS | :8080 | The address and port to listen on. |
|
|
| context-size | CONTEXT_SIZE | 512 | Default token context size. |
|
|
|
|
Once the server is running, you can start making requests to it using HTTP, using the OpenAI API.
|
|
|
|
</details>
|
|
|
|
### Supported OpenAI API endpoints
|
|
|
|
You can check out the [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create).
|
|
|
|
Following the list of endpoints/parameters supported.
|
|
|
|
#### Chat completions
|
|
|
|
For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body:
|
|
|
|
```
|
|
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
|
|
"model": "ggml-koala-7b-model-q4_0-r2.bin",
|
|
"messages": [{"role": "user", "content": "Say this is a test!"}],
|
|
"temperature": 0.7
|
|
}'
|
|
```
|
|
|
|
Available additional parameters: `top_p`, `top_k`, `max_tokens`
|
|
|
|
#### Completions
|
|
|
|
For example, to generate a comletion, you can send a POST request to the `/v1/completions` endpoint with the instruction as the request body:
|
|
```
|
|
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
|
|
"model": "ggml-koala-7b-model-q4_0-r2.bin",
|
|
"prompt": "A long time ago in a galaxy far, far away",
|
|
"temperature": 0.7
|
|
}'
|
|
```
|
|
|
|
Available additional parameters: `top_p`, `top_k`, `max_tokens`
|
|
|
|
#### List models
|
|
|
|
You can list all the models available with:
|
|
|
|
```
|
|
curl http://localhost:8080/v1/models
|
|
```
|
|
|
|
## Using other models
|
|
|
|
gpt4all (https://github.com/nomic-ai/gpt4all) works as well, however the original model needs to be converted (same applies for old alpaca models, too):
|
|
|
|
```bash
|
|
wget -O tokenizer.model https://huggingface.co/decapoda-research/llama-30b-hf/resolve/main/tokenizer.model
|
|
mkdir models
|
|
cp gpt4all.. models/
|
|
git clone https://gist.github.com/eiz/828bddec6162a023114ce19146cb2b82
|
|
pip install sentencepiece
|
|
python 828bddec6162a023114ce19146cb2b82/gistfile1.txt models tokenizer.model
|
|
# There will be a new model with the ".tmp" extension, you have to use that one!
|
|
```
|
|
|
|
### Windows compatibility
|
|
|
|
It should work, however you need to make sure you give enough resources to the container. See https://github.com/go-skynet/LocalAI/issues/2
|
|
|
|
### Kubernetes
|
|
|
|
You can run the API in Kubernetes, see an example deployment in [kubernetes](https://github.com/go-skynet/LocalAI/tree/master/kubernetes)
|
|
|
|
### Build locally
|
|
|
|
Pre-built images might fit well for most of the modern hardware, however you can and might need to build the images manually.
|
|
|
|
In order to build the `LocalAI` container image locally you can use `docker`:
|
|
|
|
```
|
|
# build the image
|
|
docker build -t LocalAI .
|
|
docker run LocalAI
|
|
```
|
|
|
|
Or build the binary with `make`:
|
|
|
|
```
|
|
make build
|
|
```
|
|
|
|
## Short-term roadmap
|
|
|
|
- [x] Mimic OpenAI API (https://github.com/go-skynet/LocalAI/issues/10)
|
|
- Binary releases (https://github.com/go-skynet/LocalAI/issues/6)
|
|
- Upstream our golang bindings to llama.cpp (https://github.com/ggerganov/llama.cpp/issues/351)
|
|
- [x] Multi-model support
|
|
- Have a webUI!
|
|
|
|
## License
|
|
|
|
MIT
|
|
|
|
## Acknowledgements
|
|
|
|
- [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
- https://github.com/tatsu-lab/stanford_alpaca
|
|
- https://github.com/cornelk/llama-go for the initial ideas
|
|
- https://github.com/antimatter15/alpaca.cpp for the light model version (this is compatible and tested only with that checkpoint model!)
|