2022-09-25 22:35:26 +03:00
# whisper.cpp
2023-04-28 22:41:29 +03:00
![whisper.cpp ](https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg )
2022-10-08 11:46:34 +03:00
[![Actions Status ](https://github.com/ggerganov/whisper.cpp/workflows/CI/badge.svg )](https://github.com/ggerganov/whisper.cpp/actions)
[![License: MIT ](https://img.shields.io/badge/license-MIT-blue.svg )](https://opensource.org/licenses/MIT)
2024-05-30 14:43:28 +02:00
[![Conan Center ](https://shields.io/conan/v/whisper-cpp )](https://conan.io/center/whisper-cpp)
2022-12-12 20:20:51 +02:00
[![npm ](https://img.shields.io/npm/v/whisper.cpp.svg )](https://www.npmjs.com/package/whisper.cpp/)
2022-10-08 11:46:34 +03:00
2024-12-18 18:12:40 +02:00
Stable: [v1.7.3 ](https://github.com/ggerganov/whisper.cpp/releases/tag/v1.7.3 ) / [Roadmap | F.A.Q. ](https://github.com/ggerganov/whisper.cpp/discussions/126 )
2022-12-16 23:41:57 +02:00
2022-09-30 19:33:09 +03:00
High-performance inference of [OpenAI's Whisper ](https://github.com/openai/whisper ) automatic speech recognition (ASR) model:
2022-09-25 22:35:26 +03:00
- Plain C/C++ implementation without dependencies
2024-10-17 13:25:18 +03:00
- Apple Silicon first-class citizen - optimized via ARM NEON, Accelerate framework, Metal and [Core ML ](#core-ml-support )
2022-10-17 21:44:16 +03:00
- AVX intrinsics support for x86 architectures
2023-01-04 23:00:30 -05:00
- VSX intrinsics support for POWER architectures
2022-10-01 00:01:04 +03:00
- Mixed F16 / F32 precision
2024-11-15 16:00:10 +02:00
- [Integer quantization support ](#quantization )
2022-09-28 21:13:32 +03:00
- Zero memory allocations at runtime
2024-10-17 13:25:18 +03:00
- [Vulkan support ](#vulkan-gpu-support )
2023-09-15 12:18:18 +03:00
- Support for CPU-only inference
2024-10-17 13:25:18 +03:00
- [Efficient GPU support for NVIDIA ](#nvidia-gpu-support )
- [OpenVINO Support ](#openvino-support )
- [Ascend NPU Support ](#ascend-npu-support )
2024-08-30 13:58:22 +03:00
- [C-style API ](https://github.com/ggerganov/whisper.cpp/blob/master/include/whisper.h )
2022-10-23 08:04:33 +03:00
Supported platforms:
- [x] Mac OS (Intel and Arm)
2022-10-24 18:26:21 +03:00
- [x] [iOS ](examples/whisper.objc )
2022-12-16 19:28:51 +02:00
- [x] [Android ](examples/whisper.android )
2023-05-21 01:25:02 +10:00
- [x] [Java ](bindings/java/README.md )
2022-12-16 18:01:05 +02:00
- [x] Linux / [FreeBSD ](https://github.com/ggerganov/whisper.cpp/issues/56#issuecomment-1350920264 )
2022-10-24 18:26:21 +03:00
- [x] [WebAssembly ](examples/whisper.wasm )
2022-11-23 22:27:49 +02:00
- [x] Windows ([MSVC ](https://github.com/ggerganov/whisper.cpp/blob/master/.github/workflows/build.yml#L117-L144 ) and [MinGW ](https://github.com/ggerganov/whisper.cpp/issues/168 )]
2022-11-21 18:52:20 +02:00
- [x] [Raspberry Pi ](https://github.com/ggerganov/whisper.cpp/discussions/166 )
2024-08-28 10:42:18 +02:00
- [x] [Docker ](https://github.com/ggerganov/whisper.cpp/pkgs/container/whisper.cpp )
2022-10-02 18:19:22 +03:00
2024-08-20 03:57:45 -04:00
The entire high-level implementation of the model is contained in [whisper.h ](include/whisper.h ) and [whisper.cpp ](src/whisper.cpp ).
2024-01-26 07:39:54 -08:00
The rest of the code is part of the [`ggml` ](https://github.com/ggerganov/ggml ) machine learning library.
2022-10-25 19:13:08 +03:00
2022-10-23 10:12:10 +03:00
Having such a lightweight implementation of the model allows to easily integrate it in different platforms and applications.
2022-11-23 09:53:55 +02:00
As an example, here is a video of running the model on an iPhone 13 device - fully offline, on-device: [whisper.objc ](examples/whisper.objc )
2022-10-23 10:12:10 +03:00
2022-10-23 12:47:51 +03:00
https://user-images.githubusercontent.com/1991296/197385372-962a6dea-bca1-4d50-bf96-1d8c27b98c81.mp4
2022-10-23 10:12:10 +03:00
2022-11-26 11:40:06 +02:00
You can also easily make your own offline voice assistant application: [command ](examples/command )
2022-11-25 20:23:58 +02:00
https://user-images.githubusercontent.com/1991296/204038393-2f846eae-c255-4099-a76d-5735c25c49da.mp4
2022-11-23 09:52:36 +02:00
2023-10-13 17:53:23 +02:00
On Apple Silicon, the inference runs fully on the GPU via Metal:
2023-09-15 12:18:18 +03:00
https://github.com/ggerganov/whisper.cpp/assets/1991296/c82e8f86-60dc-49f2-b048-d2fdbd6b5225
2022-10-25 19:13:08 +03:00
## Quick start
2022-09-25 22:35:26 +03:00
2024-01-26 07:39:54 -08:00
First clone the repository:
2023-04-14 16:33:06 +00:00
2024-01-26 07:39:54 -08:00
```bash
git clone https://github.com/ggerganov/whisper.cpp.git
```
2024-10-14 13:14:57 +05:30
Navigate into the directory:
```
cd whisper.cpp
```
2024-01-26 07:39:54 -08:00
Then, download one of the Whisper [models ](models/README.md ) converted in [`ggml` format ](#ggml-format ). For example:
2022-09-26 09:36:51 +03:00
```bash
2024-09-24 20:07:51 +02:00
sh ./models/download-ggml-model.sh base.en
2022-09-26 09:36:51 +03:00
```
2024-12-21 09:43:49 +02:00
Now build the [whisper-cli ](examples/cli ) example and transcribe an audio file like this:
2022-09-26 09:36:51 +03:00
```bash
2024-12-21 09:43:49 +02:00
# build the project
2024-12-08 15:48:14 +02:00
cmake -B build
cmake --build build --config Release
2022-10-25 19:13:08 +03:00
# transcribe an audio file
2024-12-21 09:43:49 +02:00
./build/bin/whisper-cli -f samples/jfk.wav
2022-09-26 09:36:51 +03:00
```
---
2024-12-21 09:43:49 +02:00
For a quick demo, simply run `make base.en` .
2022-09-25 22:35:26 +03:00
The command downloads the `base.en` model converted to custom `ggml` format and runs the inference on all `.wav` samples in the folder `samples` .
2024-12-21 09:43:49 +02:00
For detailed usage instructions, run: `./build/bin/whisper-cli -h`
2022-09-29 23:48:01 +03:00
2024-12-21 09:43:49 +02:00
Note that the [whisper-cli ](examples/cli ) example currently runs only with 16-bit WAV files, so make sure to convert your input before running the tool.
2022-09-29 23:48:01 +03:00
For example, you can use `ffmpeg` like this:
2024-01-26 07:39:54 -08:00
```bash
2022-09-29 23:48:01 +03:00
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
```
## More audio samples
2022-09-25 22:35:26 +03:00
If you want some extra audio samples to play with, simply run:
```
2024-10-17 13:25:18 +03:00
make -j samples
2022-09-25 22:35:26 +03:00
```
This will download a few more audio files from Wikipedia and convert them to 16-bit WAV format via `ffmpeg` .
2022-09-28 20:46:05 +03:00
You can download and run the other models as follows:
2022-09-25 22:35:26 +03:00
```
2024-10-17 13:25:18 +03:00
make -j tiny.en
make -j tiny
make -j base.en
make -j base
make -j small.en
make -j small
make -j medium.en
make -j medium
make -j large-v1
make -j large-v2
make -j large-v3
make -j large-v3-turbo
2022-09-25 22:35:26 +03:00
```
2022-10-28 22:09:40 +03:00
## Memory usage
2024-01-26 07:39:54 -08:00
| Model | Disk | Mem |
| ------ | ------- | ------- |
| tiny | 75 MiB | ~273 MB |
2023-11-15 19:42:25 +02:00
| base | 142 MiB | ~388 MB |
| small | 466 MiB | ~852 MB |
| medium | 1.5 GiB | ~2.1 GB |
| large | 2.9 GiB | ~3.9 GB |
2022-10-28 22:09:40 +03:00
2023-04-30 18:51:57 +03:00
## Quantization
`whisper.cpp` supports integer quantization of the Whisper `ggml` models.
Quantized models require less memory and disk space and depending on the hardware can be processed more efficiently.
Here are the steps for creating and using a quantized model:
```bash
# quantize a model with Q5_0 method
2024-12-08 15:48:14 +02:00
cmake -B build
cmake --build build --config Release
./build/bin/quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0
2023-04-30 18:51:57 +03:00
# run the examples as usual, specifying the quantized model file
2024-12-21 09:43:49 +02:00
./build/bin/whisper-cli -m models/ggml-base.en-q5_0.bin ./samples/gb0.wav
2023-04-30 18:51:57 +03:00
```
2023-04-15 13:30:07 +03:00
## Core ML support
On Apple Silicon devices, the Encoder inference can be executed on the Apple Neural Engine (ANE) via Core ML. This can result in significant
speed-up - more than x3 faster compared with CPU-only execution. Here are the instructions for generating a Core ML model and using it with `whisper.cpp` :
- Install Python dependencies needed for the creation of the Core ML model:
```bash
pip install ane_transformers
pip install openai-whisper
pip install coremltools
```
2023-05-14 23:11:08 +08:00
- To ensure `coremltools` operates correctly, please confirm that [Xcode ](https://developer.apple.com/xcode/ ) is installed and execute `xcode-select --install` to install the command-line tools.
- Python 3.10 is recommended.
2024-03-04 11:16:13 -08:00
- MacOS Sonoma (version 14) or newer is recommended, as older versions of MacOS might experience issues with transcription hallucination.
2024-01-26 07:39:54 -08:00
- [OPTIONAL] It is recommended to utilize a Python version management system, such as [Miniconda ](https://docs.conda.io/en/latest/miniconda.html ) for this step:
2023-05-14 23:11:08 +08:00
- To create an environment, use: `conda create -n py310-whisper python=3.10 -y`
- To activate the environment, use: `conda activate py310-whisper`
2023-04-15 13:30:07 +03:00
- Generate a Core ML model. For example, to generate a `base.en` model, use:
```bash
./models/generate-coreml-model.sh base.en
```
This will generate the folder `models/ggml-base.en-encoder.mlmodelc`
- Build `whisper.cpp` with Core ML support:
```bash
# using CMake
2023-09-05 18:53:34 +08:00
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release
2023-04-15 13:30:07 +03:00
```
- Run the examples as usual. For example:
2024-01-26 07:39:54 -08:00
```text
2024-12-21 09:43:49 +02:00
$ ./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav
2023-04-15 13:30:07 +03:00
...
whisper_init_state: loading Core ML model from 'models/ggml-base.en-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
2023-04-30 12:14:33 +03:00
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |
2023-04-15 13:30:07 +03:00
...
```
The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format.
Next runs are faster.
2023-04-30 12:14:33 +03:00
2023-04-15 13:30:07 +03:00
For more information about the Core ML implementation please refer to PR [#566 ](https://github.com/ggerganov/whisper.cpp/pull/566 ).
2023-04-30 12:14:33 +03:00
2023-07-25 12:07:59 -04:00
## OpenVINO support
On platforms that support [OpenVINO ](https://github.com/openvinotoolkit/openvino ), the Encoder inference can be executed
on OpenVINO-supported devices including x86 CPUs and Intel GPUs (integrated & discrete).
This can result in significant speedup in encoder performance. Here are the instructions for generating the OpenVINO model and using it with `whisper.cpp` :
- First, setup python virtual env. and install python dependencies. Python 3.10 is recommended.
Windows:
2024-01-26 07:39:54 -08:00
```powershell
2023-07-25 12:07:59 -04:00
cd models
python -m venv openvino_conv_env
openvino_conv_env\Scripts\activate
python -m pip install --upgrade pip
2024-02-18 21:19:47 -05:00
pip install -r requirements-openvino.txt
2023-07-25 12:07:59 -04:00
```
Linux and macOS:
2024-01-26 07:39:54 -08:00
```bash
2023-07-25 12:07:59 -04:00
cd models
python3 -m venv openvino_conv_env
source openvino_conv_env/bin/activate
python -m pip install --upgrade pip
2024-02-18 21:19:47 -05:00
pip install -r requirements-openvino.txt
2023-07-25 12:07:59 -04:00
```
- Generate an OpenVINO encoder model. For example, to generate a `base.en` model, use:
```
python convert-whisper-to-openvino.py --model base.en
```
2024-01-26 07:39:54 -08:00
This will produce ggml-base.en-encoder-openvino.xml/.bin IR model files. It's recommended to relocate these to the same folder as `ggml` models, as that
2023-07-25 12:07:59 -04:00
is the default location that the OpenVINO extension will search at runtime.
- Build `whisper.cpp` with OpenVINO support:
Download OpenVINO package from [release page ](https://github.com/openvinotoolkit/openvino/releases ). The recommended version to use is [2023.0.0 ](https://github.com/openvinotoolkit/openvino/releases/tag/2023.0.0 ).
After downloading & extracting package onto your development system, set up required environment by sourcing setupvars script. For example:
Linux:
2024-01-26 07:39:54 -08:00
2023-07-25 12:07:59 -04:00
```bash
source /path/to/l_openvino_toolkit_ubuntu22_2023.0.0.10926.b4452d56304_x86_64/setupvars.sh
```
Windows (cmd):
2024-01-26 07:39:54 -08:00
```powershell
2023-07-25 12:07:59 -04:00
C:\Path\To\w_openvino_toolkit_windows_2023.0.0.10926.b4452d56304_x86_64\setupvars.bat
```
And then build the project using cmake:
2024-01-26 07:39:54 -08:00
2023-07-25 12:07:59 -04:00
```bash
2023-09-05 18:53:34 +08:00
cmake -B build -DWHISPER_OPENVINO=1
cmake --build build -j --config Release
2023-07-25 12:07:59 -04:00
```
- Run the examples as usual. For example:
2024-01-26 07:39:54 -08:00
```text
2024-12-21 09:43:49 +02:00
$ ./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav
2023-07-25 12:07:59 -04:00
...
whisper_ctx_init_openvino_encoder: loading OpenVINO model from 'models/ggml-base.en-encoder-openvino.xml'
whisper_ctx_init_openvino_encoder: first run on a device may take a while ...
whisper_openvino_init: path_model = models/ggml-base.en-encoder-openvino.xml, device = GPU, cache_dir = models/ggml-base.en-encoder-openvino-cache
whisper_ctx_init_openvino_encoder: OpenVINO model loaded
system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 1 |
...
```
The first time run on an OpenVINO device is slow, since the OpenVINO framework will compile the IR (Intermediate Representation) model to a device-specific 'blob'. This device-specific blob will get
cached for the next run.
2023-11-12 15:40:37 +02:00
2023-07-25 12:07:59 -04:00
For more information about the Core ML implementation please refer to PR [#1037 ](https://github.com/ggerganov/whisper.cpp/pull/1037 ).
2023-11-12 15:40:37 +02:00
## NVIDIA GPU support
2023-04-30 12:14:33 +03:00
2023-11-12 15:40:37 +02:00
With NVIDIA cards the processing of the models is done efficiently on the GPU via cuBLAS and custom CUDA kernels.
2023-04-30 12:14:33 +03:00
First, make sure you have installed `cuda` : https://developer.nvidia.com/cuda-downloads
2024-03-27 18:55:10 +02:00
Now build `whisper.cpp` with CUDA support:
2023-04-30 12:14:33 +03:00
```
2024-12-08 15:48:14 +02:00
cmake -B build -DGGML_CUDA=1
cmake --build build -j --config Release
2023-04-30 12:14:33 +03:00
```
2024-10-16 18:43:26 +03:00
## Vulkan GPU support
Cross-vendor solution which allows you to accelerate workload on your GPU.
First, make sure your graphics card driver provides support for Vulkan API.
Now build `whisper.cpp` with Vulkan support:
```
2024-12-08 15:48:14 +02:00
cmake -B build -DGGML_VULKAN=1
cmake --build build -j --config Release
2024-10-16 18:43:26 +03:00
```
2023-05-24 04:23:51 -04:00
## BLAS CPU support via OpenBLAS
Encoder processing can be accelerated on the CPU via OpenBLAS.
First, make sure you have installed `openblas` : https://www.openblas.net/
Now build `whisper.cpp` with OpenBLAS support:
```
2024-12-08 15:48:14 +02:00
cmake -B build -DGGML_BLAS=1
cmake --build build -j --config Release
2024-04-09 18:32:46 +03:00
```
2024-09-11 20:59:24 +08:00
## Ascend NPU support
2024-12-08 15:48:14 +02:00
Ascend NPU provides inference acceleration via [`CANN` ](https://www.hiascend.com/en/software/cann ) and AI cores.
2024-09-11 20:59:24 +08:00
First, check if your Ascend NPU device is supported:
**Verified devices**
| Ascend NPU | Status |
|:-----------------------------:|:-------:|
| Atlas 300T A2 | Support |
Then, make sure you have installed [`CANN toolkit` ](https://www.hiascend.com/en/software/cann/community ) . The lasted version of CANN is recommanded.
Now build `whisper.cpp` with CANN support:
```
2024-12-08 15:48:14 +02:00
cmake -B build -DGGML_CANN=1
cmake --build build -j --config Release
2024-09-11 20:59:24 +08:00
```
Run the inference examples as usual, for example:
```
2024-12-21 09:43:49 +02:00
./build/bin/whisper-cli -f samples/jfk.wav -m models/ggml-base.en.bin -t 8
2024-09-11 20:59:24 +08:00
```
*Notes:*
- If you have trouble with Ascend NPU device, please create a issue with ** [CANN]** prefix/tag.
- If you run successfully with your Ascend NPU device, please help update the table `Verified devices` .
2024-05-30 14:06:15 +02:00
## Installing with Conan
You can install pre-built binaries for whisper.cpp or build it from source using [Conan ](https://conan.io/ ). Use the following command:
```
conan install --requires="whisper-cpp/[*]" --build=missing
```
For detailed instructions on how to use Conan, please refer to the [Conan documentation ](https://docs.conan.io/2/ ).
2022-12-07 05:15:46 +02:00
## Limitations
- Inference only
2022-10-04 23:27:25 +03:00
## Real-time audio input example
This is a naive example of performing real-time inference on audio from your microphone.
2023-03-30 01:51:33 -03:00
The [stream ](examples/stream ) tool samples the audio every half a second and runs the transcription continuously.
2022-10-10 22:16:25 +03:00
More info is available in [issue #10 ](https://github.com/ggerganov/whisper.cpp/issues/10 ).
2022-10-04 23:27:25 +03:00
2024-01-26 07:39:54 -08:00
```bash
2024-12-08 15:48:14 +02:00
cmake -B build
cmake --build build --config Release
./build/bin/stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000
2022-10-04 23:27:25 +03:00
```
2022-10-10 22:05:37 +03:00
https://user-images.githubusercontent.com/1991296/194935793-76afede7-cfa8-48d8-a80f-28ba83be7d09.mp4
2022-10-04 23:27:25 +03:00
2022-10-22 21:16:08 +03:00
## Confidence color-coding
Adding the `--print-colors` argument will print the transcribed text using an experimental color coding strategy
to highlight words with high or low confidence:
2024-01-26 07:39:54 -08:00
```bash
2024-12-21 09:43:49 +02:00
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/gb0.wav --print-colors
2023-04-14 18:25:23 +02:00
```
2022-10-22 21:16:08 +03:00
< img width = "965" alt = "image" src = "https://user-images.githubusercontent.com/1991296/197356445-311c8643-9397-4e5e-b46e-0b4b4daa2530.png" >
2022-11-02 22:03:27 +02:00
## Controlling the length of the generated text segments (experimental)
2022-10-30 17:10:46 +02:00
2023-02-04 09:45:52 +02:00
For example, to limit the line length to a maximum of 16 characters, simply add `-ml 16` :
2022-10-30 17:10:46 +02:00
2024-01-26 07:39:54 -08:00
```text
2024-12-21 09:43:49 +02:00
$ ./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16
2022-11-02 22:03:27 +02:00
whisper_model_load: loading model from './models/ggml-base.en.bin'
...
2023-02-04 09:45:52 +02:00
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |
2022-11-02 22:03:27 +02:00
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:00.850] And so my
[00:00:00.850 --> 00:00:01.590] fellow
[00:00:01.590 --> 00:00:04.140] Americans, ask
[00:00:04.140 --> 00:00:05.660] not what your
[00:00:05.660 --> 00:00:06.840] country can do
[00:00:06.840 --> 00:00:08.430] for you, ask
[00:00:08.430 --> 00:00:09.440] what you can do
[00:00:09.440 --> 00:00:10.020] for your
[00:00:10.020 --> 00:00:11.000] country.
```
2023-07-04 09:51:22 +03:00
## Word-level timestamp (experimental)
2022-11-02 22:03:27 +02:00
The `--max-len` argument can be used to obtain word-level timestamps. Simply use `-ml 1` :
2024-01-26 07:39:54 -08:00
```text
2024-12-21 09:43:49 +02:00
$ ./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
2022-11-02 22:03:27 +02:00
whisper_model_load: loading model from './models/ggml-base.en.bin'
...
2023-02-04 09:45:52 +02:00
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |
2022-11-02 22:03:27 +02:00
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
2023-04-30 12:14:33 +03:00
[00:00:00.000 --> 00:00:00.320]
2022-11-02 22:03:27 +02:00
[00:00:00.320 --> 00:00:00.370] And
[00:00:00.370 --> 00:00:00.690] so
[00:00:00.690 --> 00:00:00.850] my
[00:00:00.850 --> 00:00:01.590] fellow
[00:00:01.590 --> 00:00:02.850] Americans
[00:00:02.850 --> 00:00:03.300] ,
[00:00:03.300 --> 00:00:04.140] ask
[00:00:04.140 --> 00:00:04.990] not
[00:00:04.990 --> 00:00:05.410] what
[00:00:05.410 --> 00:00:05.660] your
[00:00:05.660 --> 00:00:06.260] country
[00:00:06.260 --> 00:00:06.600] can
[00:00:06.600 --> 00:00:06.840] do
[00:00:06.840 --> 00:00:07.010] for
[00:00:07.010 --> 00:00:08.170] you
[00:00:08.170 --> 00:00:08.190] ,
[00:00:08.190 --> 00:00:08.430] ask
[00:00:08.430 --> 00:00:08.910] what
[00:00:08.910 --> 00:00:09.040] you
[00:00:09.040 --> 00:00:09.320] can
[00:00:09.320 --> 00:00:09.440] do
[00:00:09.440 --> 00:00:09.760] for
[00:00:09.760 --> 00:00:10.020] your
[00:00:10.020 --> 00:00:10.510] country
[00:00:10.510 --> 00:00:11.000] .
```
2023-07-04 09:51:22 +03:00
## Speaker segmentation via tinydiarize (experimental)
More information about this approach is available here: https://github.com/ggerganov/whisper.cpp/pull/1058
Sample usage:
```py
# download a tinydiarize compatible model
./models/download-ggml-model.sh small.en-tdrz
# run as usual, adding the "-tdrz" command-line argument
2024-12-21 09:43:49 +02:00
./build/bin/whisper-cli -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -tdrz
2023-07-04 09:51:22 +03:00
...
main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...
...
[00:00:00.000 --> 00:00:03.800] Okay Houston, we've had a problem here. [SPEAKER_TURN]
[00:00:03.800 --> 00:00:06.200] This is Houston. Say again please. [SPEAKER_TURN]
[00:00:06.200 --> 00:00:08.260] Uh Houston we've had a problem.
[00:00:08.260 --> 00:00:11.320] We've had a main beam up on a volt. [SPEAKER_TURN]
[00:00:11.320 --> 00:00:13.820] Roger main beam interval. [SPEAKER_TURN]
[00:00:13.820 --> 00:00:15.100] Uh uh [SPEAKER_TURN]
[00:00:15.100 --> 00:00:18.020] So okay stand, by thirteen we're looking at it. [SPEAKER_TURN]
[00:00:18.020 --> 00:00:25.740] Okay uh right now uh Houston the uh voltage is uh is looking good um.
[00:00:27.620 --> 00:00:29.940] And we had a a pretty large bank or so.
```
2022-11-02 22:03:27 +02:00
## Karaoke-style movie generation (experimental)
2024-12-21 09:43:49 +02:00
The [whisper-cli ](examples/cli ) example provides support for output of karaoke-style movies, where the
2022-11-02 22:03:27 +02:00
currently pronounced word is highlighted. Use the `-wts` argument and run the generated bash script.
This requires to have `ffmpeg` installed.
2022-10-30 17:10:46 +02:00
2024-05-30 14:43:28 +02:00
Here are a few _"typical"_ examples:
2022-10-30 17:10:46 +02:00
2024-01-26 07:39:54 -08:00
```bash
2024-12-21 09:43:49 +02:00
./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -owts
2022-10-30 17:10:46 +02:00
source ./samples/jfk.wav.wts
ffplay ./samples/jfk.wav.mp4
```
2022-11-01 22:47:58 +02:00
https://user-images.githubusercontent.com/1991296/199337465-dbee4b5e-9aeb-48a3-b1c6-323ac4db5b2c.mp4
2022-10-30 17:10:46 +02:00
---
2024-01-26 07:39:54 -08:00
```bash
2024-12-21 09:43:49 +02:00
./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f ./samples/mm0.wav -owts
2022-10-30 17:10:46 +02:00
source ./samples/mm0.wav.wts
ffplay ./samples/mm0.wav.mp4
```
2022-11-01 22:47:58 +02:00
https://user-images.githubusercontent.com/1991296/199337504-cc8fd233-0cb7-4920-95f9-4227de3570aa.mp4
2022-10-30 17:10:46 +02:00
---
2024-01-26 07:39:54 -08:00
```bash
2024-12-21 09:43:49 +02:00
./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f ./samples/gb0.wav -owts
2022-10-30 17:10:46 +02:00
source ./samples/gb0.wav.wts
ffplay ./samples/gb0.wav.mp4
```
2022-11-01 22:47:58 +02:00
https://user-images.githubusercontent.com/1991296/199337538-b7b0c7a3-2753-4a88-a0cd-f28a317987ba.mp4
2022-10-30 17:10:46 +02:00
---
2023-03-06 21:06:27 +02:00
## Video comparison of different models
2024-04-09 20:12:17 +03:00
Use the [scripts/bench-wts.sh ](https://github.com/ggerganov/whisper.cpp/blob/master/scripts/bench-wts.sh ) script to generate a video in the following format:
2023-03-06 21:06:27 +02:00
2024-01-26 07:39:54 -08:00
```bash
2024-04-09 20:12:17 +03:00
./scripts/bench-wts.sh samples/jfk.wav
2023-03-06 21:06:27 +02:00
ffplay ./samples/jfk.wav.all.mp4
```
https://user-images.githubusercontent.com/1991296/223206245-2d36d903-cf8e-4f09-8c3b-eb9f9c39d6fc.mp4
---
2022-10-25 20:43:10 +03:00
## Benchmarks
In order to have an objective comparison of the performance of the inference across different system configurations,
2024-12-21 09:43:49 +02:00
use the [whisper-bench ](examples/bench ) tool. The tool simply runs the Encoder part of the model and prints how much time it
2022-10-25 20:43:10 +03:00
took to execute it. The results are summarized in the following Github issue:
[Benchmark results ](https://github.com/ggerganov/whisper.cpp/issues/89 )
2022-09-25 22:35:26 +03:00
2024-08-28 16:45:05 +08:00
Additionally a script to run whisper.cpp with different models and audio files is provided [bench.py ](scripts/bench.py ).
2023-09-25 08:45:15 -07:00
You can run it with the following command, by default it will run against any standard model in the models folder.
```bash
2024-04-09 20:12:17 +03:00
python3 scripts/bench.py -f samples/jfk.wav -t 2,4,8 -p 1,2
2023-09-25 08:45:15 -07:00
```
It is written in python with the intention of being easy to modify and extend for your benchmarking use case.
It outputs a csv file with the results of the benchmarking.
2024-01-26 07:39:54 -08:00
## `ggml` format
2022-09-25 22:35:26 +03:00
The original models are converted to a custom binary format. This allows to pack everything needed into a single file:
- model parameters
- mel filters
- vocabulary
- weights
2022-11-15 19:47:06 +02:00
You can download the converted models using the [models/download-ggml-model.sh ](models/download-ggml-model.sh ) script
or manually from here:
2022-10-10 22:06:03 +03:00
2023-03-22 20:44:56 +02:00
- https://huggingface.co/ggerganov/whisper.cpp
2022-11-15 19:47:06 +02:00
- https://ggml.ggerganov.com
2022-09-26 09:36:51 +03:00
2024-01-26 07:39:54 -08:00
For more details, see the conversion script [models/convert-pt-to-ggml.py ](models/convert-pt-to-ggml.py ) or [models/README.md ](models/README.md ).
2022-10-11 00:36:32 +03:00
2022-12-22 18:22:58 +02:00
## [Bindings](https://github.com/ggerganov/whisper.cpp/discussions/categories/bindings)
2022-10-11 00:36:32 +03:00
2024-01-26 07:39:54 -08:00
- [x] Rust: [tazz4843/whisper-rs ](https://github.com/tazz4843/whisper-rs ) | [#310 ](https://github.com/ggerganov/whisper.cpp/discussions/310 )
- [x] JavaScript: [bindings/javascript ](bindings/javascript ) | [#309 ](https://github.com/ggerganov/whisper.cpp/discussions/309 )
2023-03-23 03:39:02 +08:00
- React Native (iOS / Android): [whisper.rn ](https://github.com/mybigday/whisper.rn )
2024-01-26 07:39:54 -08:00
- [x] Go: [bindings/go ](bindings/go ) | [#312 ](https://github.com/ggerganov/whisper.cpp/discussions/312 )
- [x] Java:
2023-06-25 04:46:07 -07:00
- [GiviMAD/whisper-jni ](https://github.com/GiviMAD/whisper-jni )
2024-01-26 07:39:54 -08:00
- [x] Ruby: [bindings/ruby ](bindings/ruby ) | [#507 ](https://github.com/ggerganov/whisper.cpp/discussions/507 )
- [x] Objective-C / Swift: [ggerganov/whisper.spm ](https://github.com/ggerganov/whisper.spm ) | [#313 ](https://github.com/ggerganov/whisper.cpp/discussions/313 )
2023-04-14 13:24:00 -04:00
- [exPHAT/SwiftWhisper ](https://github.com/exPHAT/SwiftWhisper )
2024-01-26 07:39:54 -08:00
- [x] .NET: | [#422 ](https://github.com/ggerganov/whisper.cpp/discussions/422 )
2023-02-14 20:04:03 +02:00
- [sandrohanea/whisper.net ](https://github.com/sandrohanea/whisper.net )
- [NickDarvey/whisper ](https://github.com/NickDarvey/whisper )
2024-01-26 07:39:54 -08:00
- [x] Python: | [#9 ](https://github.com/ggerganov/whisper.cpp/issues/9 )
2023-02-24 08:46:06 +02:00
- [stlukey/whispercpp.py ](https://github.com/stlukey/whispercpp.py ) (Cython)
2024-04-16 19:15:52 +08:00
- [AIWintermuteAI/whispercpp ](https://github.com/AIWintermuteAI/whispercpp ) (Updated fork of aarnphm/whispercpp)
2023-02-27 11:02:11 -08:00
- [aarnphm/whispercpp ](https://github.com/aarnphm/whispercpp ) (Pybind11)
2024-08-30 07:00:38 -04:00
- [abdeladim-s/pywhispercpp ](https://github.com/abdeladim-s/pywhispercpp ) (Pybind11)
2024-01-26 07:39:54 -08:00
- [x] R: [bnosac/audio.whisper ](https://github.com/bnosac/audio.whisper )
- [x] Unity: [macoron/whisper.unity ](https://github.com/Macoron/whisper.unity )
2022-10-25 20:47:31 +03:00
## Examples
2022-11-26 11:56:55 +02:00
There are various examples of using the library for different projects in the [examples ](examples ) folder.
Some of the examples are even ported to run in the browser using WebAssembly. Check them out!
2024-01-26 07:39:54 -08:00
| Example | Web | Description |
| --------------------------------------------------- | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
2024-12-21 09:43:49 +02:00
| [whisper-cli ](examples/cli ) | [whisper.wasm ](examples/whisper.wasm ) | Tool for translating and transcribing audio using Whisper |
| [whisper-bench ](examples/bench ) | [bench.wasm ](examples/bench.wasm ) | Benchmark the performance of Whisper on your machine |
| [whisper-stream ](examples/stream ) | [stream.wasm ](examples/stream.wasm ) | Real-time transcription of raw microphone capture |
| [whisper-command ](examples/command ) | [command.wasm ](examples/command.wasm ) | Basic voice assistant example for receiving voice commands from the mic |
| [whisper-server ](examples/server ) | | HTTP transcription server with OAI-like API |
| [whisper-talk-llama ](examples/talk-llama ) | | Talk with a LLaMA bot |
2024-01-26 07:39:54 -08:00
| [whisper.objc ](examples/whisper.objc ) | | iOS mobile application using whisper.cpp |
| [whisper.swiftui ](examples/whisper.swiftui ) | | SwiftUI iOS / macOS application using whisper.cpp |
| [whisper.android ](examples/whisper.android ) | | Android mobile application using whisper.cpp |
| [whisper.nvim ](examples/whisper.nvim ) | | Speech-to-text plugin for Neovim |
| [generate-karaoke.sh ](examples/generate-karaoke.sh ) | | Helper script to easily [generate a karaoke video ](https://youtu.be/uj7hVta4blM ) of raw audio capture |
| [livestream.sh ](examples/livestream.sh ) | | [Livestream audio transcription ](https://github.com/ggerganov/whisper.cpp/issues/185 ) |
| [yt-wsp.sh ](examples/yt-wsp.sh ) | | Download + transcribe and/or translate any VOD [(original) ](https://gist.github.com/DaniruKun/96f763ec1a037cc92fe1a059b643b818 ) |
2024-12-21 09:43:49 +02:00
| [wchess ](examples/wchess ) | [wchess.wasm ](examples/wchess ) | Voice-controlled chess |
2022-11-06 21:04:21 +02:00
2022-11-27 11:30:32 +02:00
## [Discussions](https://github.com/ggerganov/whisper.cpp/discussions)
If you have any kind of feedback about this project feel free to use the Discussions section and open a new topic.
You can use the [Show and tell ](https://github.com/ggerganov/whisper.cpp/discussions/categories/show-and-tell ) category
to share your own projects that use `whisper.cpp` . If you have a question, make sure to check the
[Frequently asked questions (#126) ](https://github.com/ggerganov/whisper.cpp/discussions/126 ) discussion.