LocalAI/README.md

## :camel: LocalAI

> :warning: This project has been renamed from `llama-cli` to `LocalAI` to reflect the fact that we are focusing on a fast drop-in OpenAI API rather on the CLI interface. We think that there are already many projects that can be used as a CLI interface already, for instance  [llama.cpp](https://github.com/ggerganov/llama.cpp) and [gpt4all](https://github.com/nomic-ai/gpt4all). If you are were using `llama-cli` for CLI interactions and want to keep using it, use older versions or please open up an issue - contributions are welcome!

LocalAI is a straightforward, drop-in replacement API compatible with OpenAI for local CPU inferencing, based on [llama.cpp](https://github.com/ggerganov/llama.cpp), [gpt4all](https://github.com/nomic-ai/gpt4all) and [ggml](https://github.com/ggerganov/ggml), including support GPT4ALL-J which is Apache 2.0 Licensed and can be used for commercial purposes.

- OpenAI compatible API
- Supports multiple-models
- Once loaded the first time, it keep models loaded in memory for faster inference
- Provides a simple command line interface that allows text generation directly from the terminal
- Support for prompt templates
- Doesn't shell-out, but uses C bindings for a faster inference and better performance. Uses [go-llama.cpp](https://github.com/go-skynet/go-llama.cpp) and [go-gpt4all-j.cpp](https://github.com/go-skynet/go-gpt4all-j.cpp).

## Model compatibility

It is compatible with the models supported by [llama.cpp](https://github.com/ggerganov/llama.cpp) and also [GPT4ALL-J](https://github.com/nomic-ai/gpt4all).

Note: You might need to convert older models to the new format, see [here](https://github.com/ggerganov/llama.cpp#using-gpt4all) for instance to run `gpt4all`.

## Usage

> `LocalAI` comes by default as a container image. You can check out all the available images with corresponding tags [here](https://quay.io/repository/go-skynet/local-ai?tab=tags&tag=latest).

The easiest way to run LocalAI is by using `docker-compose`:

```bash

git clone https://github.com/go-skynet/LocalAI

cd LocalAI

# copy your models to models/
cp your-model.bin models/

# (optional) Edit the .env file to set things like context size and threads
# vim .env

# start with docker-compose
docker compose up -d --build

# Now API is accessible at localhost:8080
curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"your-model.bin","object":"model"}]}

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "your-model.bin",            
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'
```

## Prompt templates 

The API doesn't inject a default prompt for talking to the model. You have to use a prompt similar to what's described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release.

<details>
You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibiling file, `foo.bin.tmpl` which will be used as a default prompt, for instance this can be used with alpaca:

```
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{{.Input}}

### Response:
```

See the [prompt-templates](https://github.com/go-skynet/LocalAI/tree/master/prompt-templates) directory in this repository for templates for most popular models.

</details>

## API

`LocalAI` provides an API for running text generation as a service, that follows the OpenAI reference and can be used as a drop-in. The models once loaded the first time will be kept in memory.

<details>
Example of starting the API with `docker`:

```bash
docker run -p 8080:8080 -ti --rm quay.io/go-skynet/local-api:latest --models-path /path/to/models --context-size 700 --threads 4
```

And you'll see:
```
┌───────────────────────────────────────────────────┐ 
│                   Fiber v2.42.0                   │ 
│               http://127.0.0.1:8080               │ 
│       (bound on host 0.0.0.0 and port 8080)       │ 
│                                                   │ 
│ Handlers ............. 1  Processes ........... 1 │ 
│ Prefork ....... Disabled  PID ................. 1 │ 
└───────────────────────────────────────────────────┘ 
```

Note: Models have to end up with `.bin` so can be listed by the `/models` endpoint.

You can control the API server options with command line arguments:

```
local-api --models-path <model_path> [--address <address>] [--threads <num_threads>]
```

The API takes takes the following parameters:

| Parameter    | Environment Variable | Default Value | Description                            |
| ------------ | -------------------- | ------------- | -------------------------------------- |
| models-path        | MODELS_PATH           |               | The path where you have models (ending with `.bin`).      |
| threads      | THREADS              | CPU cores     | The number of threads to use for text generation. |
| address      | ADDRESS              | :8080         | The address and port to listen on. |
| context-size | CONTEXT_SIZE         | 512           | Default token context size. |

Once the server is running, you can start making requests to it using HTTP, using the OpenAI API. 

</details>

### Supported OpenAI API endpoints

You can check out the [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create). 

Following the list of endpoints/parameters supported.

#### Chat completions

For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body:

```
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "ggml-koala-7b-model-q4_0-r2.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'
```

Available additional parameters: `top_p`, `top_k`, `max_tokens`

#### Completions

For example, to generate a comletion, you can send a POST request to the `/v1/completions` endpoint with the instruction as the request body:
```
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "ggml-koala-7b-model-q4_0-r2.bin",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'
```

Available additional parameters: `top_p`, `top_k`, `max_tokens`

#### List models

You can list all the models available with:

```
curl http://localhost:8080/v1/models
```

## Using other models

gpt4all (https://github.com/nomic-ai/gpt4all) works as well, however the original model needs to be converted (same applies for old alpaca models, too):

```bash
wget -O tokenizer.model https://huggingface.co/decapoda-research/llama-30b-hf/resolve/main/tokenizer.model
mkdir models
cp gpt4all.. models/
git clone https://gist.github.com/eiz/828bddec6162a023114ce19146cb2b82
pip install sentencepiece
python 828bddec6162a023114ce19146cb2b82/gistfile1.txt models tokenizer.model
# There will be a new model with the ".tmp" extension, you have to use that one!
```

### Windows compatibility

It should work, however you need to make sure you give enough resources to the container. See https://github.com/go-skynet/LocalAI/issues/2

### Kubernetes

You can run the API in Kubernetes, see an example deployment in [kubernetes](https://github.com/go-skynet/LocalAI/tree/master/kubernetes)

### Build locally

Pre-built images might fit well for most of the modern hardware, however you can and might need to build the images manually.

In order to build the `LocalAI` container image locally you can use `docker`:

```
# build the image
docker build -t LocalAI .
docker run LocalAI
```

Or build the binary with `make`:

```
make build
```

## Short-term roadmap

- [x] Mimic OpenAI API (https://github.com/go-skynet/LocalAI/issues/10)
- Binary releases (https://github.com/go-skynet/LocalAI/issues/6)
- Upstream our golang bindings to llama.cpp (https://github.com/ggerganov/llama.cpp/issues/351)
- [x] Multi-model support
- Have a webUI!

## License

MIT

## Acknowledgements

- [llama.cpp](https://github.com/ggerganov/llama.cpp)
- https://github.com/tatsu-lab/stanford_alpaca
- https://github.com/cornelk/llama-go for the initial ideas
- https://github.com/antimatter15/alpaca.cpp for the light model version (this is compatible and tested only with that checkpoint model!)
Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`## :camel: LocalAI`
Add README 2023-03-20 20:30:55 +00:00
Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			> :warning: This project has been renamed from `llama-cli` to `LocalAI` to reflect the fact that we are focusing on a fast drop-in OpenAI API rather on the CLI interface. We think that there are already many projects that can be used as a CLI interface already, for instance [llama.cpp](https://github.com/ggerganov/llama.cpp) and [gpt4all](https://github.com/nomic-ai/gpt4all). If you are were using `llama-cli` for CLI interactions and want to keep using it, use older versions or please open up an issue - contributions are welcome!
Add README 2023-03-20 20:30:55 +00:00
Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`LocalAI is a straightforward, drop-in replacement API compatible with OpenAI for local CPU inferencing, based on [llama.cpp](https://github.com/ggerganov/llama.cpp), [gpt4all](https://github.com/nomic-ai/gpt4all) and [ggml](https://github.com/ggerganov/ggml), including support GPT4ALL-J which is Apache 2.0 Licensed and can be used for commercial purposes.`
Minor rephrasing 2023-04-11 22:04:15 +00:00
Enhancements (#34) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 15:10:29 +00:00			`- OpenAI compatible API`
			`- Supports multiple-models`
			`- Once loaded the first time, it keep models loaded in memory for faster inference`
			`- Provides a simple command line interface that allows text generation directly from the terminal`
			`- Support for prompt templates`
			`- Doesn't shell-out, but uses C bindings for a faster inference and better performance. Uses [go-llama.cpp](https://github.com/go-skynet/go-llama.cpp) and [go-gpt4all-j.cpp](https://github.com/go-skynet/go-gpt4all-j.cpp).`
Update README 2023-03-30 16:46:11 +00:00
Enhancements (#34) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 15:10:29 +00:00			`## Model compatibility`

			`It is compatible with the models supported by [llama.cpp](https://github.com/ggerganov/llama.cpp) and also [GPT4ALL-J](https://github.com/nomic-ai/gpt4all).`

			Note: You might need to convert older models to the new format, see [here](https://github.com/ggerganov/llama.cpp#using-gpt4all) for instance to run `gpt4all`.
Add README 2023-03-20 20:30:55 +00:00
Add docker-compose Fixes #14 Signed-off-by: mudler <mudler@c3os.io> 2023-04-12 23:13:14 +00:00			`## Usage`

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			> `LocalAI` comes by default as a container image. You can check out all the available images with corresponding tags [here](https://quay.io/repository/go-skynet/local-ai?tab=tags&tag=latest).

			The easiest way to run LocalAI is by using `docker-compose`:
Add docker-compose Fixes #14 Signed-off-by: mudler <mudler@c3os.io> 2023-04-12 23:13:14 +00:00
			```bash

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`git clone https://github.com/go-skynet/LocalAI`

			`cd LocalAI`
Add docker-compose Fixes #14 Signed-off-by: mudler <mudler@c3os.io> 2023-04-12 23:13:14 +00:00
			`# copy your models to models/`
			`cp your-model.bin models/`

feature: makefile & updates (#23) Co-authored-by: mudler <mudler@c3os.io> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com> 2023-04-15 23:39:07 +00:00			`# (optional) Edit the .env file to set things like context size and threads`
			`# vim .env`
Add docker-compose Fixes #14 Signed-off-by: mudler <mudler@c3os.io> 2023-04-12 23:13:14 +00:00
			`# start with docker-compose`
			`docker compose up -d --build`

			`# Now API is accessible at localhost:8080`
			`curl http://localhost:8080/v1/models`
			`# {"object":"list","data":[{"id":"your-model.bin","object":"model"}]}`
Enhancements (#34) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 15:10:29 +00:00
Add docker-compose Fixes #14 Signed-off-by: mudler <mudler@c3os.io> 2023-04-12 23:13:14 +00:00			`curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{`
			`"model": "your-model.bin",`
			`"prompt": "A long time ago in a galaxy far, far away",`
			`"temperature": 0.7`
			`}'`
			```

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`## Prompt templates`

			`The API doesn't inject a default prompt for talking to the model. You have to use a prompt similar to what's described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release.`
feat: drop embedded webui (#27) Signed-off-by: mudler <mudler@c3os.io> 2023-04-16 08:46:20 +00:00
Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`<details>`
feat: drop embedded webui (#27) Signed-off-by: mudler <mudler@c3os.io> 2023-04-16 08:46:20 +00:00			You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibiling file, `foo.bin.tmpl` which will be used as a default prompt, for instance this can be used with alpaca:
Add docker-compose Fixes #14 Signed-off-by: mudler <mudler@c3os.io> 2023-04-12 23:13:14 +00:00
			```
			`Below is an instruction that describes a task. Write a response that appropriately completes the request.`

			`### Instruction:`
			`{{.Input}}`

			`### Response:`
			```

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`See the [prompt-templates](https://github.com/go-skynet/LocalAI/tree/master/prompt-templates) directory in this repository for templates for most popular models.`
Add README 2023-03-20 20:30:55 +00:00
Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`</details>`
Add README 2023-03-20 20:30:55 +00:00
Update README 2023-04-11 22:02:47 +00:00			`## API`
Add README 2023-03-20 20:30:55 +00:00
Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`LocalAI` provides an API for running text generation as a service, that follows the OpenAI reference and can be used as a drop-in. The models once loaded the first time will be kept in memory.
Update README to use tagged container images 2023-03-21 17:45:59 +00:00
Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`<details>`
Update README to use tagged container images 2023-03-21 17:45:59 +00:00			Example of starting the API with `docker`:

			```bash
Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`docker run -p 8080:8080 -ti --rm quay.io/go-skynet/local-api:latest --models-path /path/to/models --context-size 700 --threads 4`
Update README to use tagged container images 2023-03-21 17:45:59 +00:00			```

			`And you'll see:`
			```
			`┌───────────────────────────────────────────────────┐`
			`│ Fiber v2.42.0 │`
			`│ http://127.0.0.1:8080 │`
			`│ (bound on host 0.0.0.0 and port 8080) │`
			`│ │`
			`│ Handlers ............. 1 Processes ........... 1 │`
			`│ Prefork ....... Disabled PID ................. 1 │`
			`└───────────────────────────────────────────────────┘`
			```

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			Note: Models have to end up with `.bin` so can be listed by the `/models` endpoint.
Update README 2023-04-11 22:02:47 +00:00
Update README to use tagged container images 2023-03-21 17:45:59 +00:00			`You can control the API server options with command line arguments:`
Add README 2023-03-20 20:30:55 +00:00
			```
Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`local-api --models-path <model_path> [--address <address>] [--threads <num_threads>]`
Add README 2023-03-20 20:30:55 +00:00			```

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`The API takes takes the following parameters:`
Add README 2023-03-20 20:30:55 +00:00
			`\| Parameter \| Environment Variable \| Default Value \| Description \|`
			`\| ------------ \| -------------------- \| ------------- \| -------------------------------------- \|`
Update README 2023-04-11 22:02:47 +00:00			\| models-path \| MODELS_PATH \| \| The path where you have models (ending with `.bin`). \|
Add README 2023-03-20 20:30:55 +00:00			`\| threads \| THREADS \| CPU cores \| The number of threads to use for text generation. \|`
			`\| address \| ADDRESS \| :8080 \| The address and port to listen on. \|`
Update README 2023-03-23 17:57:25 +00:00			`\| context-size \| CONTEXT_SIZE \| 512 \| Default token context size. \|`
Add README 2023-03-20 20:30:55 +00:00
Update README 2023-04-11 22:02:47 +00:00			`Once the server is running, you can start making requests to it using HTTP, using the OpenAI API.`

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`</details>`

Update README 2023-04-11 22:02:47 +00:00			`### Supported OpenAI API endpoints`

			`You can check out the [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create).`

			`Following the list of endpoints/parameters supported.`

			`#### Chat completions`

			For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body:

			```
			`curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{`
			`"model": "ggml-koala-7b-model-q4_0-r2.bin",`
			`"messages": [{"role": "user", "content": "Say this is a test!"}],`
			`"temperature": 0.7`
			`}'`
			```

			Available additional parameters: `top_p`, `top_k`, `max_tokens`

			`#### Completions`

			For example, to generate a comletion, you can send a POST request to the `/v1/completions` endpoint with the instruction as the request body:
			```
			`curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{`
			`"model": "ggml-koala-7b-model-q4_0-r2.bin",`
			`"prompt": "A long time ago in a galaxy far, far away",`
			`"temperature": 0.7`
			`}'`
			```

			Available additional parameters: `top_p`, `top_k`, `max_tokens`
Add README 2023-03-20 20:30:55 +00:00
Update README 2023-04-11 22:02:47 +00:00			`#### List models`

			`You can list all the models available with:`
Add README 2023-03-20 20:30:55 +00:00
			```
Update README 2023-04-11 22:02:47 +00:00			`curl http://localhost:8080/v1/models`
Add README 2023-03-20 20:30:55 +00:00			```

Update README 2023-04-11 22:02:47 +00:00			`## Using other models`
Update README with 13B and 30B model instructions 2023-03-21 23:18:48 +00:00
Update README 2023-03-30 16:46:11 +00:00			`gpt4all (https://github.com/nomic-ai/gpt4all) works as well, however the original model needs to be converted (same applies for old alpaca models, too):`
Add gpt4all instructions 2023-03-29 16:58:54 +00:00
			```bash
			`wget -O tokenizer.model https://huggingface.co/decapoda-research/llama-30b-hf/resolve/main/tokenizer.model`
			`mkdir models`
			`cp gpt4all.. models/`
			`git clone https://gist.github.com/eiz/828bddec6162a023114ce19146cb2b82`
			`pip install sentencepiece`
			`python 828bddec6162a023114ce19146cb2b82/gistfile1.txt models tokenizer.model`
Update README 2023-03-30 16:46:11 +00:00			`# There will be a new model with the ".tmp" extension, you have to use that one!`
Add gpt4all instructions 2023-03-29 16:58:54 +00:00			```

Update README.md 2023-04-04 22:41:02 +00:00			`### Windows compatibility`

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`It should work, however you need to make sure you give enough resources to the container. See https://github.com/go-skynet/LocalAI/issues/2`
Update README.md 2023-04-04 22:41:02 +00:00
Add README 2023-03-20 20:30:55 +00:00			`### Kubernetes`

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`You can run the API in Kubernetes, see an example deployment in [kubernetes](https://github.com/go-skynet/LocalAI/tree/master/kubernetes)`
Update README with building instructions 2023-03-24 00:11:13 +00:00
			`### Build locally`

			`Pre-built images might fit well for most of the modern hardware, however you can and might need to build the images manually.`

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			In order to build the `LocalAI` container image locally you can use `docker`:
Update README with building instructions 2023-03-24 00:11:13 +00:00
			```
Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`# build the image`
			`docker build -t LocalAI .`
			`docker run LocalAI`
Update README with building instructions 2023-03-24 00:11:13 +00:00			```

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			Or build the binary with `make`:
Update README with building instructions 2023-03-24 00:11:13 +00:00
			```
Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`make build`
Update README with building instructions 2023-03-24 00:11:13 +00:00			```
Update README 2023-03-30 16:46:11 +00:00
Update README.md Add short-term roadmap and mention webui 2023-04-05 20:04:35 +00:00			`## Short-term roadmap`

Rename project to LocalAI (#35) Signed-off-by: mudler <mudler@c3os.io> 2023-04-19 16:43:10 +00:00			`- [x] Mimic OpenAI API (https://github.com/go-skynet/LocalAI/issues/10)`
			`- Binary releases (https://github.com/go-skynet/LocalAI/issues/6)`
Update README.md Add short-term roadmap and mention webui 2023-04-05 20:04:35 +00:00			`- Upstream our golang bindings to llama.cpp (https://github.com/ggerganov/llama.cpp/issues/351)`
Update README 2023-04-11 22:02:47 +00:00			`- [x] Multi-model support`
			`- Have a webUI!`
Update README.md Add short-term roadmap and mention webui 2023-04-05 20:04:35 +00:00
Update README 2023-03-30 16:46:11 +00:00			`## License`

			`MIT`

			`## Acknowledgements`

			`- [llama.cpp](https://github.com/ggerganov/llama.cpp)`
			`- https://github.com/tatsu-lab/stanford_alpaca`
			`- https://github.com/cornelk/llama-go for the initial ideas`
Update README.md 2023-04-04 22:41:02 +00:00			`- https://github.com/antimatter15/alpaca.cpp for the light model version (this is compatible and tested only with that checkpoint model!)`