mirror of
https://github.com/mudler/LocalAI.git
synced 2025-05-06 18:48:24 +00:00
* update doc on COQUI_LANGUAGE env variable Signed-off-by: blob42 <contact@blob42.xyz> * return errors from tts gRPC backend Signed-off-by: blob42 <contact@blob42.xyz> * handle speaker_id and language in coqui TTS backend Signed-off-by: blob42 <contact@blob42.xyz> * TTS endpoint: add optional language paramter Signed-off-by: blob42 <contact@blob42.xyz> * tts fix: empty language string breaks non-multilingual models Signed-off-by: blob42 <contact@blob42.xyz> * allow tts param definition in config file - consolidate TTS options under `tts` config entry Signed-off-by: blob42 <contact@blob42.xyz> * tts: update doc Signed-off-by: blob42 <contact@blob42.xyz> --------- Signed-off-by: blob42 <contact@blob42.xyz> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
204 lines
5.7 KiB
Markdown
204 lines
5.7 KiB
Markdown
|
|
+++
|
|
disableToc = false
|
|
title = "🗣 Text to audio (TTS)"
|
|
weight = 11
|
|
url = "/features/text-to-audio/"
|
|
+++
|
|
|
|
## API Compatibility
|
|
|
|
The LocalAI TTS API is compatible with the [OpenAI TTS API](https://platform.openai.com/docs/guides/text-to-speech) and the [Elevenlabs](https://api.elevenlabs.io/docs) API.
|
|
|
|
## LocalAI API
|
|
|
|
The `/tts` endpoint can also be used to generate speech from text.
|
|
|
|
## Usage
|
|
|
|
Input: `input`, `model`
|
|
|
|
For example, to generate an audio file, you can send a POST request to the `/tts` endpoint with the instruction as the request body:
|
|
|
|
```bash
|
|
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
|
|
"input": "Hello world",
|
|
"model": "tts"
|
|
}'
|
|
```
|
|
|
|
Returns an `audio/wav` file.
|
|
|
|
|
|
## Backends
|
|
|
|
### 🐸 Coqui
|
|
|
|
Required: Don't use `LocalAI` images ending with the `-core` tag,. Python dependencies are required in order to use this backend.
|
|
|
|
Coqui works without any configuration, to test it, you can run the following curl command:
|
|
|
|
```
|
|
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
|
|
"backend": "coqui",
|
|
"model": "tts_models/en/ljspeech/glow-tts",
|
|
"input":"Hello, this is a test!"
|
|
}'
|
|
```
|
|
|
|
You can use the env variable COQUI_LANGUAGE to set the language used by the coqui backend.
|
|
|
|
You can also use config files to configure tts models (see section below on how to use config files).
|
|
|
|
### Bark
|
|
|
|
[Bark](https://github.com/suno-ai/bark) allows to generate audio from text prompts.
|
|
|
|
This is an extra backend - in the container is already available and there is nothing to do for the setup.
|
|
|
|
#### Model setup
|
|
|
|
There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.
|
|
|
|
#### Usage
|
|
|
|
Use the `tts` endpoint by specifying the `bark` backend:
|
|
|
|
```
|
|
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
|
|
"backend": "bark",
|
|
"input":"Hello!"
|
|
}' | aplay
|
|
```
|
|
|
|
To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the `model` parameter:
|
|
|
|
```
|
|
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
|
|
"backend": "bark",
|
|
"input":"Hello!",
|
|
"model": "v2/en_speaker_4"
|
|
}' | aplay
|
|
```
|
|
|
|
### Piper
|
|
|
|
To install the `piper` audio models manually:
|
|
|
|
- Download Voices from https://github.com/rhasspy/piper/releases/tag/v0.0.2
|
|
- Extract the `.tar.tgz` files (.onnx,.json) inside `models`
|
|
- Run the following command to test the model is working
|
|
|
|
To use the tts endpoint, run the following command. You can specify a backend with the `backend` parameter. For example, to use the `piper` backend:
|
|
```bash
|
|
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
|
|
"model":"it-riccardo_fasol-x-low.onnx",
|
|
"backend": "piper",
|
|
"input": "Ciao, sono Ettore"
|
|
}' | aplay
|
|
```
|
|
|
|
Note:
|
|
|
|
- `aplay` is a Linux command. You can use other tools to play the audio file.
|
|
- The model name is the filename with the extension.
|
|
- The model name is case sensitive.
|
|
- LocalAI must be compiled with the `GO_TAGS=tts` flag.
|
|
|
|
### Transformers-musicgen
|
|
|
|
LocalAI also has experimental support for `transformers-musicgen` for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:
|
|
|
|
```
|
|
curl --request POST \
|
|
--url http://localhost:8080/tts \
|
|
--header 'Content-Type: application/json' \
|
|
--data '{
|
|
"backend": "transformers-musicgen",
|
|
"model": "facebook/musicgen-medium",
|
|
"input": "Cello Rave"
|
|
}' | aplay
|
|
```
|
|
|
|
Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
|
|
|
|
### Vall-E-X
|
|
|
|
[VALL-E-X](https://github.com/Plachtaa/VALL-E-X) is an open source implementation of Microsoft's VALL-E X zero-shot TTS model.
|
|
|
|
#### Setup
|
|
|
|
The backend will automatically download the required files in order to run the model.
|
|
|
|
This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.
|
|
|
|
#### Usage
|
|
|
|
Use the tts endpoint by specifying the vall-e-x backend:
|
|
|
|
```
|
|
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
|
|
"backend": "vall-e-x",
|
|
"input":"Hello!"
|
|
}' | aplay
|
|
```
|
|
|
|
#### Voice cloning
|
|
|
|
In order to use voice cloning capabilities you must create a `YAML` configuration file to setup a model:
|
|
|
|
```yaml
|
|
name: cloned-voice
|
|
backend: vall-e-x
|
|
parameters:
|
|
model: "cloned-voice"
|
|
tts:
|
|
vall-e:
|
|
# The path to the audio file to be cloned
|
|
# relative to the models directory
|
|
# Max 15s
|
|
audio_path: "audio-sample.wav"
|
|
```
|
|
|
|
Then you can specify the model name in the requests:
|
|
|
|
```
|
|
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
|
|
"model": "cloned-voice",
|
|
"input":"Hello!"
|
|
}' | aplay
|
|
```
|
|
|
|
### Parler-tts
|
|
|
|
`parler-tts`. It is possible to install and configure the model directly from the gallery. https://github.com/huggingface/parler-tts
|
|
|
|
|
|
## Using config files
|
|
|
|
You can also use a `config-file` to specify TTS models and their parameters.
|
|
|
|
In the following example we define a custom config to load the `xtts_v2` model, and specify a voice and language.
|
|
|
|
```yaml
|
|
|
|
name: xtts_v2
|
|
backend: coqui
|
|
parameters:
|
|
language: fr
|
|
model: tts_models/multilingual/multi-dataset/xtts_v2
|
|
|
|
tts:
|
|
voice: Ana Florence
|
|
```
|
|
|
|
With this config, you can now use the following curl command to generate a text-to-speech audio file:
|
|
```bash
|
|
curl -L http://localhost:8080/tts \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "xtts_v2",
|
|
"input": "Bonjour, je suis Ana Florence. Comment puis-je vous aider?"
|
|
}' | aplay
|
|
```
|