mirror of
https://github.com/ggerganov/whisper.cpp.git
synced 2025-02-04 18:13:13 +00:00
2e59dced12
* whisper : rename binaries + fix install * cont : try to fix ci * cont : fix emscripten builds
631 lines
27 KiB
Markdown
631 lines
27 KiB
Markdown
# whisper.cpp
|
|
|
|
![whisper.cpp](https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg)
|
|
|
|
[![Actions Status](https://github.com/ggerganov/whisper.cpp/workflows/CI/badge.svg)](https://github.com/ggerganov/whisper.cpp/actions)
|
|
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
|
|
[![Conan Center](https://shields.io/conan/v/whisper-cpp)](https://conan.io/center/whisper-cpp)
|
|
[![npm](https://img.shields.io/npm/v/whisper.cpp.svg)](https://www.npmjs.com/package/whisper.cpp/)
|
|
|
|
Stable: [v1.7.3](https://github.com/ggerganov/whisper.cpp/releases/tag/v1.7.3) / [Roadmap | F.A.Q.](https://github.com/ggerganov/whisper.cpp/discussions/126)
|
|
|
|
High-performance inference of [OpenAI's Whisper](https://github.com/openai/whisper) automatic speech recognition (ASR) model:
|
|
|
|
- Plain C/C++ implementation without dependencies
|
|
- Apple Silicon first-class citizen - optimized via ARM NEON, Accelerate framework, Metal and [Core ML](#core-ml-support)
|
|
- AVX intrinsics support for x86 architectures
|
|
- VSX intrinsics support for POWER architectures
|
|
- Mixed F16 / F32 precision
|
|
- [Integer quantization support](#quantization)
|
|
- Zero memory allocations at runtime
|
|
- [Vulkan support](#vulkan-gpu-support)
|
|
- Support for CPU-only inference
|
|
- [Efficient GPU support for NVIDIA](#nvidia-gpu-support)
|
|
- [OpenVINO Support](#openvino-support)
|
|
- [Ascend NPU Support](#ascend-npu-support)
|
|
- [C-style API](https://github.com/ggerganov/whisper.cpp/blob/master/include/whisper.h)
|
|
|
|
Supported platforms:
|
|
|
|
- [x] Mac OS (Intel and Arm)
|
|
- [x] [iOS](examples/whisper.objc)
|
|
- [x] [Android](examples/whisper.android)
|
|
- [x] [Java](bindings/java/README.md)
|
|
- [x] Linux / [FreeBSD](https://github.com/ggerganov/whisper.cpp/issues/56#issuecomment-1350920264)
|
|
- [x] [WebAssembly](examples/whisper.wasm)
|
|
- [x] Windows ([MSVC](https://github.com/ggerganov/whisper.cpp/blob/master/.github/workflows/build.yml#L117-L144) and [MinGW](https://github.com/ggerganov/whisper.cpp/issues/168)]
|
|
- [x] [Raspberry Pi](https://github.com/ggerganov/whisper.cpp/discussions/166)
|
|
- [x] [Docker](https://github.com/ggerganov/whisper.cpp/pkgs/container/whisper.cpp)
|
|
|
|
The entire high-level implementation of the model is contained in [whisper.h](include/whisper.h) and [whisper.cpp](src/whisper.cpp).
|
|
The rest of the code is part of the [`ggml`](https://github.com/ggerganov/ggml) machine learning library.
|
|
|
|
Having such a lightweight implementation of the model allows to easily integrate it in different platforms and applications.
|
|
As an example, here is a video of running the model on an iPhone 13 device - fully offline, on-device: [whisper.objc](examples/whisper.objc)
|
|
|
|
https://user-images.githubusercontent.com/1991296/197385372-962a6dea-bca1-4d50-bf96-1d8c27b98c81.mp4
|
|
|
|
You can also easily make your own offline voice assistant application: [command](examples/command)
|
|
|
|
https://user-images.githubusercontent.com/1991296/204038393-2f846eae-c255-4099-a76d-5735c25c49da.mp4
|
|
|
|
On Apple Silicon, the inference runs fully on the GPU via Metal:
|
|
|
|
https://github.com/ggerganov/whisper.cpp/assets/1991296/c82e8f86-60dc-49f2-b048-d2fdbd6b5225
|
|
|
|
## Quick start
|
|
|
|
First clone the repository:
|
|
|
|
```bash
|
|
git clone https://github.com/ggerganov/whisper.cpp.git
|
|
```
|
|
|
|
Navigate into the directory:
|
|
|
|
```
|
|
cd whisper.cpp
|
|
```
|
|
|
|
Then, download one of the Whisper [models](models/README.md) converted in [`ggml` format](#ggml-format). For example:
|
|
|
|
```bash
|
|
sh ./models/download-ggml-model.sh base.en
|
|
```
|
|
|
|
Now build the [whisper-cli](examples/cli) example and transcribe an audio file like this:
|
|
|
|
```bash
|
|
# build the project
|
|
cmake -B build
|
|
cmake --build build --config Release
|
|
|
|
# transcribe an audio file
|
|
./build/bin/whisper-cli -f samples/jfk.wav
|
|
```
|
|
|
|
---
|
|
|
|
For a quick demo, simply run `make base.en`.
|
|
|
|
The command downloads the `base.en` model converted to custom `ggml` format and runs the inference on all `.wav` samples in the folder `samples`.
|
|
|
|
For detailed usage instructions, run: `./build/bin/whisper-cli -h`
|
|
|
|
Note that the [whisper-cli](examples/cli) example currently runs only with 16-bit WAV files, so make sure to convert your input before running the tool.
|
|
For example, you can use `ffmpeg` like this:
|
|
|
|
```bash
|
|
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
|
|
```
|
|
|
|
## More audio samples
|
|
|
|
If you want some extra audio samples to play with, simply run:
|
|
|
|
```
|
|
make -j samples
|
|
```
|
|
|
|
This will download a few more audio files from Wikipedia and convert them to 16-bit WAV format via `ffmpeg`.
|
|
|
|
You can download and run the other models as follows:
|
|
|
|
```
|
|
make -j tiny.en
|
|
make -j tiny
|
|
make -j base.en
|
|
make -j base
|
|
make -j small.en
|
|
make -j small
|
|
make -j medium.en
|
|
make -j medium
|
|
make -j large-v1
|
|
make -j large-v2
|
|
make -j large-v3
|
|
make -j large-v3-turbo
|
|
```
|
|
|
|
## Memory usage
|
|
|
|
| Model | Disk | Mem |
|
|
| ------ | ------- | ------- |
|
|
| tiny | 75 MiB | ~273 MB |
|
|
| base | 142 MiB | ~388 MB |
|
|
| small | 466 MiB | ~852 MB |
|
|
| medium | 1.5 GiB | ~2.1 GB |
|
|
| large | 2.9 GiB | ~3.9 GB |
|
|
|
|
## Quantization
|
|
|
|
`whisper.cpp` supports integer quantization of the Whisper `ggml` models.
|
|
Quantized models require less memory and disk space and depending on the hardware can be processed more efficiently.
|
|
|
|
Here are the steps for creating and using a quantized model:
|
|
|
|
```bash
|
|
# quantize a model with Q5_0 method
|
|
cmake -B build
|
|
cmake --build build --config Release
|
|
./build/bin/quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0
|
|
|
|
# run the examples as usual, specifying the quantized model file
|
|
./build/bin/whisper-cli -m models/ggml-base.en-q5_0.bin ./samples/gb0.wav
|
|
```
|
|
|
|
## Core ML support
|
|
|
|
On Apple Silicon devices, the Encoder inference can be executed on the Apple Neural Engine (ANE) via Core ML. This can result in significant
|
|
speed-up - more than x3 faster compared with CPU-only execution. Here are the instructions for generating a Core ML model and using it with `whisper.cpp`:
|
|
|
|
- Install Python dependencies needed for the creation of the Core ML model:
|
|
|
|
```bash
|
|
pip install ane_transformers
|
|
pip install openai-whisper
|
|
pip install coremltools
|
|
```
|
|
|
|
- To ensure `coremltools` operates correctly, please confirm that [Xcode](https://developer.apple.com/xcode/) is installed and execute `xcode-select --install` to install the command-line tools.
|
|
- Python 3.10 is recommended.
|
|
- MacOS Sonoma (version 14) or newer is recommended, as older versions of MacOS might experience issues with transcription hallucination.
|
|
- [OPTIONAL] It is recommended to utilize a Python version management system, such as [Miniconda](https://docs.conda.io/en/latest/miniconda.html) for this step:
|
|
- To create an environment, use: `conda create -n py310-whisper python=3.10 -y`
|
|
- To activate the environment, use: `conda activate py310-whisper`
|
|
|
|
- Generate a Core ML model. For example, to generate a `base.en` model, use:
|
|
|
|
```bash
|
|
./models/generate-coreml-model.sh base.en
|
|
```
|
|
|
|
This will generate the folder `models/ggml-base.en-encoder.mlmodelc`
|
|
|
|
- Build `whisper.cpp` with Core ML support:
|
|
|
|
```bash
|
|
# using CMake
|
|
cmake -B build -DWHISPER_COREML=1
|
|
cmake --build build -j --config Release
|
|
```
|
|
|
|
- Run the examples as usual. For example:
|
|
|
|
```text
|
|
$ ./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav
|
|
|
|
...
|
|
|
|
whisper_init_state: loading Core ML model from 'models/ggml-base.en-encoder.mlmodelc'
|
|
whisper_init_state: first run on a device may take a while ...
|
|
whisper_init_state: Core ML model loaded
|
|
|
|
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |
|
|
|
|
...
|
|
```
|
|
|
|
The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format.
|
|
Next runs are faster.
|
|
|
|
For more information about the Core ML implementation please refer to PR [#566](https://github.com/ggerganov/whisper.cpp/pull/566).
|
|
|
|
## OpenVINO support
|
|
|
|
On platforms that support [OpenVINO](https://github.com/openvinotoolkit/openvino), the Encoder inference can be executed
|
|
on OpenVINO-supported devices including x86 CPUs and Intel GPUs (integrated & discrete).
|
|
|
|
This can result in significant speedup in encoder performance. Here are the instructions for generating the OpenVINO model and using it with `whisper.cpp`:
|
|
|
|
- First, setup python virtual env. and install python dependencies. Python 3.10 is recommended.
|
|
|
|
Windows:
|
|
|
|
```powershell
|
|
cd models
|
|
python -m venv openvino_conv_env
|
|
openvino_conv_env\Scripts\activate
|
|
python -m pip install --upgrade pip
|
|
pip install -r requirements-openvino.txt
|
|
```
|
|
|
|
Linux and macOS:
|
|
|
|
```bash
|
|
cd models
|
|
python3 -m venv openvino_conv_env
|
|
source openvino_conv_env/bin/activate
|
|
python -m pip install --upgrade pip
|
|
pip install -r requirements-openvino.txt
|
|
```
|
|
|
|
- Generate an OpenVINO encoder model. For example, to generate a `base.en` model, use:
|
|
|
|
```
|
|
python convert-whisper-to-openvino.py --model base.en
|
|
```
|
|
|
|
This will produce ggml-base.en-encoder-openvino.xml/.bin IR model files. It's recommended to relocate these to the same folder as `ggml` models, as that
|
|
is the default location that the OpenVINO extension will search at runtime.
|
|
|
|
- Build `whisper.cpp` with OpenVINO support:
|
|
|
|
Download OpenVINO package from [release page](https://github.com/openvinotoolkit/openvino/releases). The recommended version to use is [2023.0.0](https://github.com/openvinotoolkit/openvino/releases/tag/2023.0.0).
|
|
|
|
After downloading & extracting package onto your development system, set up required environment by sourcing setupvars script. For example:
|
|
|
|
Linux:
|
|
|
|
```bash
|
|
source /path/to/l_openvino_toolkit_ubuntu22_2023.0.0.10926.b4452d56304_x86_64/setupvars.sh
|
|
```
|
|
|
|
Windows (cmd):
|
|
|
|
```powershell
|
|
C:\Path\To\w_openvino_toolkit_windows_2023.0.0.10926.b4452d56304_x86_64\setupvars.bat
|
|
```
|
|
|
|
And then build the project using cmake:
|
|
|
|
```bash
|
|
cmake -B build -DWHISPER_OPENVINO=1
|
|
cmake --build build -j --config Release
|
|
```
|
|
|
|
- Run the examples as usual. For example:
|
|
|
|
```text
|
|
$ ./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav
|
|
|
|
...
|
|
|
|
whisper_ctx_init_openvino_encoder: loading OpenVINO model from 'models/ggml-base.en-encoder-openvino.xml'
|
|
whisper_ctx_init_openvino_encoder: first run on a device may take a while ...
|
|
whisper_openvino_init: path_model = models/ggml-base.en-encoder-openvino.xml, device = GPU, cache_dir = models/ggml-base.en-encoder-openvino-cache
|
|
whisper_ctx_init_openvino_encoder: OpenVINO model loaded
|
|
|
|
system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 1 |
|
|
|
|
...
|
|
```
|
|
|
|
The first time run on an OpenVINO device is slow, since the OpenVINO framework will compile the IR (Intermediate Representation) model to a device-specific 'blob'. This device-specific blob will get
|
|
cached for the next run.
|
|
|
|
For more information about the Core ML implementation please refer to PR [#1037](https://github.com/ggerganov/whisper.cpp/pull/1037).
|
|
|
|
## NVIDIA GPU support
|
|
|
|
With NVIDIA cards the processing of the models is done efficiently on the GPU via cuBLAS and custom CUDA kernels.
|
|
First, make sure you have installed `cuda`: https://developer.nvidia.com/cuda-downloads
|
|
|
|
Now build `whisper.cpp` with CUDA support:
|
|
|
|
```
|
|
cmake -B build -DGGML_CUDA=1
|
|
cmake --build build -j --config Release
|
|
```
|
|
|
|
## Vulkan GPU support
|
|
Cross-vendor solution which allows you to accelerate workload on your GPU.
|
|
First, make sure your graphics card driver provides support for Vulkan API.
|
|
|
|
Now build `whisper.cpp` with Vulkan support:
|
|
```
|
|
cmake -B build -DGGML_VULKAN=1
|
|
cmake --build build -j --config Release
|
|
```
|
|
|
|
## BLAS CPU support via OpenBLAS
|
|
|
|
Encoder processing can be accelerated on the CPU via OpenBLAS.
|
|
First, make sure you have installed `openblas`: https://www.openblas.net/
|
|
|
|
Now build `whisper.cpp` with OpenBLAS support:
|
|
|
|
```
|
|
cmake -B build -DGGML_BLAS=1
|
|
cmake --build build -j --config Release
|
|
```
|
|
|
|
## Ascend NPU support
|
|
|
|
Ascend NPU provides inference acceleration via [`CANN`](https://www.hiascend.com/en/software/cann) and AI cores.
|
|
|
|
First, check if your Ascend NPU device is supported:
|
|
|
|
**Verified devices**
|
|
| Ascend NPU | Status |
|
|
|:-----------------------------:|:-------:|
|
|
| Atlas 300T A2 | Support |
|
|
|
|
Then, make sure you have installed [`CANN toolkit`](https://www.hiascend.com/en/software/cann/community) . The lasted version of CANN is recommanded.
|
|
|
|
Now build `whisper.cpp` with CANN support:
|
|
|
|
```
|
|
cmake -B build -DGGML_CANN=1
|
|
cmake --build build -j --config Release
|
|
```
|
|
|
|
Run the inference examples as usual, for example:
|
|
|
|
```
|
|
./build/bin/whisper-cli -f samples/jfk.wav -m models/ggml-base.en.bin -t 8
|
|
```
|
|
|
|
*Notes:*
|
|
|
|
- If you have trouble with Ascend NPU device, please create a issue with **[CANN]** prefix/tag.
|
|
- If you run successfully with your Ascend NPU device, please help update the table `Verified devices`.
|
|
|
|
## Installing with Conan
|
|
|
|
You can install pre-built binaries for whisper.cpp or build it from source using [Conan](https://conan.io/). Use the following command:
|
|
|
|
```
|
|
conan install --requires="whisper-cpp/[*]" --build=missing
|
|
```
|
|
|
|
For detailed instructions on how to use Conan, please refer to the [Conan documentation](https://docs.conan.io/2/).
|
|
|
|
## Limitations
|
|
|
|
- Inference only
|
|
|
|
## Real-time audio input example
|
|
|
|
This is a naive example of performing real-time inference on audio from your microphone.
|
|
The [stream](examples/stream) tool samples the audio every half a second and runs the transcription continuously.
|
|
More info is available in [issue #10](https://github.com/ggerganov/whisper.cpp/issues/10).
|
|
|
|
```bash
|
|
cmake -B build
|
|
cmake --build build --config Release
|
|
./build/bin/stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000
|
|
```
|
|
|
|
https://user-images.githubusercontent.com/1991296/194935793-76afede7-cfa8-48d8-a80f-28ba83be7d09.mp4
|
|
|
|
## Confidence color-coding
|
|
|
|
Adding the `--print-colors` argument will print the transcribed text using an experimental color coding strategy
|
|
to highlight words with high or low confidence:
|
|
|
|
```bash
|
|
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/gb0.wav --print-colors
|
|
```
|
|
|
|
<img width="965" alt="image" src="https://user-images.githubusercontent.com/1991296/197356445-311c8643-9397-4e5e-b46e-0b4b4daa2530.png">
|
|
|
|
## Controlling the length of the generated text segments (experimental)
|
|
|
|
For example, to limit the line length to a maximum of 16 characters, simply add `-ml 16`:
|
|
|
|
```text
|
|
$ ./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16
|
|
|
|
whisper_model_load: loading model from './models/ggml-base.en.bin'
|
|
...
|
|
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |
|
|
|
|
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
|
|
|
|
[00:00:00.000 --> 00:00:00.850] And so my
|
|
[00:00:00.850 --> 00:00:01.590] fellow
|
|
[00:00:01.590 --> 00:00:04.140] Americans, ask
|
|
[00:00:04.140 --> 00:00:05.660] not what your
|
|
[00:00:05.660 --> 00:00:06.840] country can do
|
|
[00:00:06.840 --> 00:00:08.430] for you, ask
|
|
[00:00:08.430 --> 00:00:09.440] what you can do
|
|
[00:00:09.440 --> 00:00:10.020] for your
|
|
[00:00:10.020 --> 00:00:11.000] country.
|
|
```
|
|
|
|
## Word-level timestamp (experimental)
|
|
|
|
The `--max-len` argument can be used to obtain word-level timestamps. Simply use `-ml 1`:
|
|
|
|
```text
|
|
$ ./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
|
|
|
|
whisper_model_load: loading model from './models/ggml-base.en.bin'
|
|
...
|
|
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |
|
|
|
|
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
|
|
|
|
[00:00:00.000 --> 00:00:00.320]
|
|
[00:00:00.320 --> 00:00:00.370] And
|
|
[00:00:00.370 --> 00:00:00.690] so
|
|
[00:00:00.690 --> 00:00:00.850] my
|
|
[00:00:00.850 --> 00:00:01.590] fellow
|
|
[00:00:01.590 --> 00:00:02.850] Americans
|
|
[00:00:02.850 --> 00:00:03.300] ,
|
|
[00:00:03.300 --> 00:00:04.140] ask
|
|
[00:00:04.140 --> 00:00:04.990] not
|
|
[00:00:04.990 --> 00:00:05.410] what
|
|
[00:00:05.410 --> 00:00:05.660] your
|
|
[00:00:05.660 --> 00:00:06.260] country
|
|
[00:00:06.260 --> 00:00:06.600] can
|
|
[00:00:06.600 --> 00:00:06.840] do
|
|
[00:00:06.840 --> 00:00:07.010] for
|
|
[00:00:07.010 --> 00:00:08.170] you
|
|
[00:00:08.170 --> 00:00:08.190] ,
|
|
[00:00:08.190 --> 00:00:08.430] ask
|
|
[00:00:08.430 --> 00:00:08.910] what
|
|
[00:00:08.910 --> 00:00:09.040] you
|
|
[00:00:09.040 --> 00:00:09.320] can
|
|
[00:00:09.320 --> 00:00:09.440] do
|
|
[00:00:09.440 --> 00:00:09.760] for
|
|
[00:00:09.760 --> 00:00:10.020] your
|
|
[00:00:10.020 --> 00:00:10.510] country
|
|
[00:00:10.510 --> 00:00:11.000] .
|
|
```
|
|
|
|
## Speaker segmentation via tinydiarize (experimental)
|
|
|
|
More information about this approach is available here: https://github.com/ggerganov/whisper.cpp/pull/1058
|
|
|
|
Sample usage:
|
|
|
|
```py
|
|
# download a tinydiarize compatible model
|
|
./models/download-ggml-model.sh small.en-tdrz
|
|
|
|
# run as usual, adding the "-tdrz" command-line argument
|
|
./build/bin/whisper-cli -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -tdrz
|
|
...
|
|
main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...
|
|
...
|
|
[00:00:00.000 --> 00:00:03.800] Okay Houston, we've had a problem here. [SPEAKER_TURN]
|
|
[00:00:03.800 --> 00:00:06.200] This is Houston. Say again please. [SPEAKER_TURN]
|
|
[00:00:06.200 --> 00:00:08.260] Uh Houston we've had a problem.
|
|
[00:00:08.260 --> 00:00:11.320] We've had a main beam up on a volt. [SPEAKER_TURN]
|
|
[00:00:11.320 --> 00:00:13.820] Roger main beam interval. [SPEAKER_TURN]
|
|
[00:00:13.820 --> 00:00:15.100] Uh uh [SPEAKER_TURN]
|
|
[00:00:15.100 --> 00:00:18.020] So okay stand, by thirteen we're looking at it. [SPEAKER_TURN]
|
|
[00:00:18.020 --> 00:00:25.740] Okay uh right now uh Houston the uh voltage is uh is looking good um.
|
|
[00:00:27.620 --> 00:00:29.940] And we had a a pretty large bank or so.
|
|
```
|
|
|
|
## Karaoke-style movie generation (experimental)
|
|
|
|
The [whisper-cli](examples/cli) example provides support for output of karaoke-style movies, where the
|
|
currently pronounced word is highlighted. Use the `-wts` argument and run the generated bash script.
|
|
This requires to have `ffmpeg` installed.
|
|
|
|
Here are a few _"typical"_ examples:
|
|
|
|
```bash
|
|
./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -owts
|
|
source ./samples/jfk.wav.wts
|
|
ffplay ./samples/jfk.wav.mp4
|
|
```
|
|
|
|
https://user-images.githubusercontent.com/1991296/199337465-dbee4b5e-9aeb-48a3-b1c6-323ac4db5b2c.mp4
|
|
|
|
---
|
|
|
|
```bash
|
|
./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f ./samples/mm0.wav -owts
|
|
source ./samples/mm0.wav.wts
|
|
ffplay ./samples/mm0.wav.mp4
|
|
```
|
|
|
|
https://user-images.githubusercontent.com/1991296/199337504-cc8fd233-0cb7-4920-95f9-4227de3570aa.mp4
|
|
|
|
---
|
|
|
|
```bash
|
|
./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f ./samples/gb0.wav -owts
|
|
source ./samples/gb0.wav.wts
|
|
ffplay ./samples/gb0.wav.mp4
|
|
```
|
|
|
|
https://user-images.githubusercontent.com/1991296/199337538-b7b0c7a3-2753-4a88-a0cd-f28a317987ba.mp4
|
|
|
|
---
|
|
|
|
## Video comparison of different models
|
|
|
|
Use the [scripts/bench-wts.sh](https://github.com/ggerganov/whisper.cpp/blob/master/scripts/bench-wts.sh) script to generate a video in the following format:
|
|
|
|
```bash
|
|
./scripts/bench-wts.sh samples/jfk.wav
|
|
ffplay ./samples/jfk.wav.all.mp4
|
|
```
|
|
|
|
https://user-images.githubusercontent.com/1991296/223206245-2d36d903-cf8e-4f09-8c3b-eb9f9c39d6fc.mp4
|
|
|
|
---
|
|
|
|
## Benchmarks
|
|
|
|
In order to have an objective comparison of the performance of the inference across different system configurations,
|
|
use the [whisper-bench](examples/bench) tool. The tool simply runs the Encoder part of the model and prints how much time it
|
|
took to execute it. The results are summarized in the following Github issue:
|
|
|
|
[Benchmark results](https://github.com/ggerganov/whisper.cpp/issues/89)
|
|
|
|
Additionally a script to run whisper.cpp with different models and audio files is provided [bench.py](scripts/bench.py).
|
|
|
|
You can run it with the following command, by default it will run against any standard model in the models folder.
|
|
|
|
```bash
|
|
python3 scripts/bench.py -f samples/jfk.wav -t 2,4,8 -p 1,2
|
|
```
|
|
|
|
It is written in python with the intention of being easy to modify and extend for your benchmarking use case.
|
|
|
|
It outputs a csv file with the results of the benchmarking.
|
|
|
|
## `ggml` format
|
|
|
|
The original models are converted to a custom binary format. This allows to pack everything needed into a single file:
|
|
|
|
- model parameters
|
|
- mel filters
|
|
- vocabulary
|
|
- weights
|
|
|
|
You can download the converted models using the [models/download-ggml-model.sh](models/download-ggml-model.sh) script
|
|
or manually from here:
|
|
|
|
- https://huggingface.co/ggerganov/whisper.cpp
|
|
- https://ggml.ggerganov.com
|
|
|
|
For more details, see the conversion script [models/convert-pt-to-ggml.py](models/convert-pt-to-ggml.py) or [models/README.md](models/README.md).
|
|
|
|
## [Bindings](https://github.com/ggerganov/whisper.cpp/discussions/categories/bindings)
|
|
|
|
- [x] Rust: [tazz4843/whisper-rs](https://github.com/tazz4843/whisper-rs) | [#310](https://github.com/ggerganov/whisper.cpp/discussions/310)
|
|
- [x] JavaScript: [bindings/javascript](bindings/javascript) | [#309](https://github.com/ggerganov/whisper.cpp/discussions/309)
|
|
- React Native (iOS / Android): [whisper.rn](https://github.com/mybigday/whisper.rn)
|
|
- [x] Go: [bindings/go](bindings/go) | [#312](https://github.com/ggerganov/whisper.cpp/discussions/312)
|
|
- [x] Java:
|
|
- [GiviMAD/whisper-jni](https://github.com/GiviMAD/whisper-jni)
|
|
- [x] Ruby: [bindings/ruby](bindings/ruby) | [#507](https://github.com/ggerganov/whisper.cpp/discussions/507)
|
|
- [x] Objective-C / Swift: [ggerganov/whisper.spm](https://github.com/ggerganov/whisper.spm) | [#313](https://github.com/ggerganov/whisper.cpp/discussions/313)
|
|
- [exPHAT/SwiftWhisper](https://github.com/exPHAT/SwiftWhisper)
|
|
- [x] .NET: | [#422](https://github.com/ggerganov/whisper.cpp/discussions/422)
|
|
- [sandrohanea/whisper.net](https://github.com/sandrohanea/whisper.net)
|
|
- [NickDarvey/whisper](https://github.com/NickDarvey/whisper)
|
|
- [x] Python: | [#9](https://github.com/ggerganov/whisper.cpp/issues/9)
|
|
- [stlukey/whispercpp.py](https://github.com/stlukey/whispercpp.py) (Cython)
|
|
- [AIWintermuteAI/whispercpp](https://github.com/AIWintermuteAI/whispercpp) (Updated fork of aarnphm/whispercpp)
|
|
- [aarnphm/whispercpp](https://github.com/aarnphm/whispercpp) (Pybind11)
|
|
- [abdeladim-s/pywhispercpp](https://github.com/abdeladim-s/pywhispercpp) (Pybind11)
|
|
- [x] R: [bnosac/audio.whisper](https://github.com/bnosac/audio.whisper)
|
|
- [x] Unity: [macoron/whisper.unity](https://github.com/Macoron/whisper.unity)
|
|
|
|
## Examples
|
|
|
|
There are various examples of using the library for different projects in the [examples](examples) folder.
|
|
Some of the examples are even ported to run in the browser using WebAssembly. Check them out!
|
|
|
|
| Example | Web | Description |
|
|
| --------------------------------------------------- | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
|
|
| [whisper-cli](examples/cli) | [whisper.wasm](examples/whisper.wasm) | Tool for translating and transcribing audio using Whisper |
|
|
| [whisper-bench](examples/bench) | [bench.wasm](examples/bench.wasm) | Benchmark the performance of Whisper on your machine |
|
|
| [whisper-stream](examples/stream) | [stream.wasm](examples/stream.wasm) | Real-time transcription of raw microphone capture |
|
|
| [whisper-command](examples/command) | [command.wasm](examples/command.wasm) | Basic voice assistant example for receiving voice commands from the mic |
|
|
| [whisper-server](examples/server) | | HTTP transcription server with OAI-like API |
|
|
| [whisper-talk-llama](examples/talk-llama) | | Talk with a LLaMA bot |
|
|
| [whisper.objc](examples/whisper.objc) | | iOS mobile application using whisper.cpp |
|
|
| [whisper.swiftui](examples/whisper.swiftui) | | SwiftUI iOS / macOS application using whisper.cpp |
|
|
| [whisper.android](examples/whisper.android) | | Android mobile application using whisper.cpp |
|
|
| [whisper.nvim](examples/whisper.nvim) | | Speech-to-text plugin for Neovim |
|
|
| [generate-karaoke.sh](examples/generate-karaoke.sh) | | Helper script to easily [generate a karaoke video](https://youtu.be/uj7hVta4blM) of raw audio capture |
|
|
| [livestream.sh](examples/livestream.sh) | | [Livestream audio transcription](https://github.com/ggerganov/whisper.cpp/issues/185) |
|
|
| [yt-wsp.sh](examples/yt-wsp.sh) | | Download + transcribe and/or translate any VOD [(original)](https://gist.github.com/DaniruKun/96f763ec1a037cc92fe1a059b643b818) |
|
|
| [wchess](examples/wchess) | [wchess.wasm](examples/wchess) | Voice-controlled chess |
|
|
|
|
## [Discussions](https://github.com/ggerganov/whisper.cpp/discussions)
|
|
|
|
If you have any kind of feedback about this project feel free to use the Discussions section and open a new topic.
|
|
You can use the [Show and tell](https://github.com/ggerganov/whisper.cpp/discussions/categories/show-and-tell) category
|
|
to share your own projects that use `whisper.cpp`. If you have a question, make sure to check the
|
|
[Frequently asked questions (#126)](https://github.com/ggerganov/whisper.cpp/discussions/126) discussion.
|