mirror of
https://github.com/ggerganov/whisper.cpp.git
synced 2025-05-21 09:47:52 +00:00
This commit adds an example that demonstrates how to use a VAD (Voice Activity Detection) model to segment an audio file into speech segments. Resolves: https://github.com/ggml-org/whisper.cpp/issues/3144
whisper.cpp/examples/vad-speech-segments
This examples demonstrates how to use a VAD (Voice Activity Detection) model to segment an audio file into speech segments.
Building the example
The example can be built using the following command:
cmake -S . -B build
cmake --build build -j8 --target vad-speech-segments
Running the example
The examples can be run using the following command, which uses a model that we use internally for testing:
./build/bin/vad-speech-segments \
-vad-model models/for-tests-silero-v5.1.2-ggml.bin \
--file samples/jfk.wav \
--no-prints
Detected 5 speech segments:
Speech segment 0: start = 0.29, end = 2.21
Speech segment 1: start = 3.30, end = 3.77
Speech segment 2: start = 4.00, end = 4.35
Speech segment 3: start = 5.38, end = 7.65
Speech segment 4: start = 8.16, end = 10.59
To see more output from whisper.cpp remove the --no-prints
argument.
Command line options
./build/bin/vad-speech-segments --help
usage: ./build/bin/vad-speech-segments [options] file
supported audio formats: flac, mp3, ogg, wav
options:
-h, --help [default] show this help message and exit
-f FNAME, --file FNAME [ ] input audio file path
-t N, --threads N [4 ] number of threads to use during computation
-ug, --use-gpu [true ] use GPU
-vm FNAME, --vad-model FNAME [ ] VAD model path
-vt N, --vad-threshold N [0.50 ] VAD threshold for speech recognition
-vspd N, --vad-min-speech-duration-ms N [250 ] VAD min speech duration (0.0-1.0)
-vsd N, --vad-min-silence-duration-ms N [100 ] VAD min silence duration (to split segments)
-vmsd N, --vad-max-speech-duration-s N [FLT_MAX] VAD max speech duration (auto-split longer)
-vp N, --vad-speech-pad-ms N [30 ] VAD speech padding (extend segments)
-vo N, --vad-samples-overlap N [0.10 ] VAD samples overlap (seconds between segments)
-np, --no-prints [false ] do not print anything other than the results