* ci : reduce musa image size
This commit contains an attempt to reduce the size of the musa Docker
image by copying only the necessary files from the build stage.
The motivation for this is that the CI runs sometimes fail with out of
memory errors. These seems to be able to pass for PRs, at least
sometimes but fail upon push to the master branch.
* ci : remove build time files instead of selective copying
* ci : add apt-get clean to musa Dockerfile
This commit adds `apt-get clean` to the musa Dockerfile to reduce the
image size by removing cached package files after installation.
The motivation for this is to try to reduce the size of the Docker image
and see if this can avoid the "no space left on device" error during
the CI build process.
Refs: https://github.com/ggml-org/whisper.cpp/actions/runs/15815324254
* Add Apple frameworks to $LDFLAGS when needed
* Add utility method to Options
* Remove unnecessary propaty date from gemspec
* Add Apple frameworks for CoreML build
* Add Accelerate framework only for Apple platform
* Fix ZipURI#cache signature
* Download test fixtures if needed
* Add header and namespace to use enqueue_functions extension
* Convert submit and parallel_for to use new extension in convert.cpp
* Convert submit and parallel_for to use extension in ggml-sycl.cpp
* Convert submit and parallel_for to use extension in gla.cpp
* Convert submit and parallel_for in mmq.cpp
* Convert submit and parallel_for in mmvq.cpp
* Convert submit and parallel_for in remaining files
* Convert all simple parallel_for to nd_launch from enqueue_functions
extension
* Wrapping extension in general function
Create a general function that enable the enqueue_functions extension if
it is enable in the compiler, otherwise call the general SYCL function
to launch kernels.
---------
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
* Add PowerPC feature detection and scoring
* ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC
* ggml-cpu: Delay some initializations until function is called
When using GGML_BACKEND_DL=ON, these initializations might use
instructions that are not supported by the current CPU.
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* android : update CMakeLists.txt to use FetchContent for ggml
This commit updates the CMakeLists.txt file for the Android Whisper
example to use FetchContent for managing the ggml library.
The motivation for this change is avoid having to make manual changes to
the CMakeLists.txt file after syncing the ggml library.
I've built and run the example locally to verify that it works as
expected.
Refs: https://github.com/ggml-org/whisper.cpp/pull/3265#issuecomment-2986715717
* android.java : update cmake to use FetchContent for ggml
This commit updates the CMake configuration for the Android Java example
to use `FetchContent` for including the `ggml` library. Do be able to
use FetchContent we also update the `compileSdkVersion` and
`targetSdkVersion` to 31, and the `buildToolsVersion` to '30.0.3'.
This also required a an update to the Gradle plugin version to 7.4.0.
The motivation for this change is avoid having to make manual changes to
the CMakeLists.txt file after syncing the ggml library.
This commit adds a conversion from stereo to mono in the
`read_audio_data` function of `common-whisper.cpp`.
The motivation for this change is prior to Commit
7d3da68f79 ("examples : use miniaudio for
direct decoding flac, mp3, ogg and wav (#2759)", there was a step that
read stereo int16 data -> pcm16 (448512 samples), and then converted to
mono (224256 samples), and then also convert to stereo in `pcmf32s.
The middle step here seems to have been missed when rewriting the code to
use Miniaudio and caused issues then transcribing stereo audio files.
For example, currently using the audio sample in the linked issue the
output is:
```console
[00:00:00.000 --> 00:00:03.000] (speaker 1) Sous-titres réalisés para la communauté d'Amara.org
```
And with the change in this commit the output is:
```
[00:00:00.000 --> 00:00:01.500] (speaker 1) *sonnerie de téléphone*
[00:00:01.500 --> 00:00:07.000] (speaker 1) Salut jeune homme !
[00:00:07.000 --> 00:00:08.500] (speaker 0) C'est vrai que je te dérange ?
[00:00:08.500 --> 00:00:10.500] (speaker 1) Ah pas du tout, pas du tout, pas du tout !
[00:00:10.500 --> 00:00:12.500] (speaker 1) J'étais en train de...
[00:00:12.500 --> 00:00:14.500] (speaker 1) de préparer un courrier
```
Resolves: https://github.com/ggml-org/whisper.cpp/issues/3092
* llama : add thread safety test
* llamafile : remove global state
* llama : better LLAMA_SPLIT_MODE_NONE logic
when main_gpu < 0 GPU devices are not used
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Update oneMath commit to merged PR https://github.com/uxlfoundation/oneMath/pull/669
which adds SYCL-Graph support for recording CUDA BLAS commands.
With this change the `MUL_MAT` tests now pass on DPC++ CUDA backends with SYCL-Graph
enabled. Prior to this change, an error would be thrown.
```
$ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2
UR CUDA ERROR:
Value: 700
Name: CUDA_ERROR_ILLEGAL_ADDRESS
Description: an illegal memory access was encountered
Function: operator()
Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154
Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator()
SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code!
in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598
$HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
```
* ggml-cpu: Factor out feature detection build from x86
* ggml-cpu: Add ARM feature detection and scoring
This is analogous to cpu-feats-x86.cpp. However, to detect compile-time
activation of features, we rely on GGML_USE_<FEAT> which need to be set
in cmake, instead of GGML_<FEAT> that users would set for x86.
This is because on ARM, users specify features with GGML_CPU_ARM_ARCH,
rather than with individual flags.
* ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for ARM
Like x86, however to pass around arch flags within cmake, we use
GGML_INTERNAL_<FEAT> as we don't have GGML_<FEAT>.
Some features are optional, so we may need to build multiple backends
per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring
function sort out which one can be used.
* ggml-cpu: Limit ARM GGML_CPU_ALL_VARIANTS to Linux for now
The other platforms will need their own specific variants.
This also fixes the bug that the the variant-building branch was always
being executed as the else-branch of GGML_NATIVE=OFF. The branch is
moved to an elseif-branch which restores the previous behavior.
This change moves the command pool/buffer tracking into a vk_command_pool
structure. There are two instances per context (for compute+transfer) and
two instances per device for operations that don't go through a context.
This should prevent separate contexts from stomping on each other.
Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8)
and move it to the vk_device. Move all the descriptor pool and set tracking to
the context - none of it is specific to pipelines anymore. It has a single vector
of pools and vector of sets, and a single counter to track requests and a single
counter to track use.
* ggml : disable warnings for tests when using MSVC
This commit disables warnings for tests on windows when using MSVC.
The motivation for this is that this brings the build output more
inline with what Linux/MacOS systems produce.
There is still one warning generated for the tests which is:
```console
Building Custom Rule C:/ggml/tests/CMakeLists.txt
cl : command line warning D9025: overriding '/DNDEBUG' with '/UNDEBUG'
[C:\ggml\build\tests\test-arange.vcxproj]
test-arange.cpp
test-arange.vcxproj -> C:\ggml\build\bin\Release\test-arange.exe
```
* ggml : fix typo in tests disable list
This commit removes the unused `ggml_context_container` structure from
the ggml library. It looks like the usage of this struct was removed in
Commit 4757fe18d56ec11bf9c07feaca6e9d5b5357e7f4 ("ggml : alloc
ggml_contexts on the heap (whisper/2525)").
The motivation for this changes is to improve code clarity/readability.
This commit adds the examples in the "list" of targets to ignore MSVC
warnings.
The motivation for this is that currently the examples generate a number
of warnings that are ignore/disabled for the core ggml project. This
makes for a cleaner output when building.
This commit clears the results_all vector no VAD segments are found.
The motivation for this is that this would normally be done by
`whisper_full_with_state` but when no VAD segments are detected this
current implementation does not call that function and hence the vector
does not get reset. This can lead to issues in applications like the
server example where it will incorrectly process the old results.
Resolves: https://github.com/ggml-org/whisper.cpp/issues/3250
This commit addresses an issue with token timestamps when audio segments
are skipped, in `whisper_exp_compute_token_level_timestamps` related to
the VAD processing and the energy levels.
The motivation for this is that the token timestamps exceed the energy
array bounds due to segment timing misalignment:
```console
(skipped introduction)
↓
Audio segment: [2600ms → 5600ms] (3 seconds of actual audio)
Energy array: [0 → 480652] (samples for 3 seconds)
Token timestamps: [3266ms → 3408ms] (absolute timestamps)
```
So both `s0` and `t1` get clamped to the maximum sample index (480652)
which causes the start/end timestamps to be the same for all the tokens
after a certain point.
This is addressed by using segment-relative timestamps in the
`timestamp_to_sample` and `sample_to_timestamp`.
* server : add Voice Activity Detection (VAD) support
This commit adds support for Voice Activity Detection (VAD) in the
server example.
The motivation for this is to enable VAD processing when using
whisper-server.
Resolves: https://github.com/ggml-org/whisper.cpp/issues/3089
* server : add VAD parameters to usage in README.md [no ci]
This commit also adds a few missing parameters.
* server : fix conflicting short options [no ci]
This commit adds entries to `.gitignore` for directories in the
`ext` directory.
The motivation for this is that currently after building locally these
following files are reported by git as untracked:
```console
Untracked files:
(use "git add <file>..." to include in what will be committed)
ext/examples/
ext/ggml/
ext/include/
ext/scripts/
ext/src/
```
* ci : update windows runner to windows-2022
This commit changes the windows-2019 runner to windows-2022.
The motiation for this is that the windows-2019 runner is scheduled for
deprection and will be removed 2025-06-30. There are currently "burnout"
periods that started 2025-06-01 and during these times jobs with
windows-2019 will fail which has happened lately on our CI.
Refs: https://github.com/actions/runner-images/issues/12045
* ruby : add cleaning of library names in dependencies
This commit adds a cleaning step to the library names in the
`Dependencies` class of the Ruby bindings.
The motivation for this is that with the introduction of a library name
alias for ggml in Commit (b933d17c30
"Add in-build ggml::ggml ALIAS library (ggml/1260)) causes the Makefile
generation to break:
```console
$ sed -n '165,170p' ext/Makefile
CLEANOBJS = $(OBJS) *.bak
TARGET_SO_DIR_TIMESTAMP = $(TIMESTAMP_DIR)/.sitearchdir.time
$(TARGET_SO): libcommon.a libwhisper.a libggml\n(ggml::ggml).a libggml-cpu.a libggml-base.a
libcommon.a libwhisper.a libggml\n(ggml::ggml).a libggml-cpu.a libggml-base.a: cmake-targets
cmake-targets:
/usr/bin/cmake -S sources -B build -D BUILD_SHARED_LIBS=OFF -D CMAKE_ARCHIVE_OUTPUT_DIRECTORY=/home/danbev/work/ai/whisper.cpp/bindings/ruby/ext -D CMAKE_POSITION_INDEPENDENT_CODE=ON
```
* squash! ruby : add cleaning of library names in dependencies
Apply PR review feedback.
* Simplify the environment variable setting to specify the memory pool type.
* Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options.
* update
* fix CI
* update
* delete whitespace
* fix according to review
* update CANN.md
* update CANN.md
* Add Reorder to Q6_K mmvq implementation
* Address PR comments: clean up comments
* Remove unused parameter after refactoring q4_k
* Adding inline to function and removing unnecessary reference to int
---------
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
* SYCL: Implement few same quantized type copy kernels
* Use memcpy for copying contiguous tensors
ggml-ci
* feat(sycl): add contiguous tensor copy support and device checks
Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.
* refactor: replace specific block copy functions with template
The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.
* Exclude BF16 support for COPY tensors for now
ggml-ci
* perf: adjust SYCL copy kernel block sizes for efficiency
Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D
* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D
* Missing barrier added to shader.
Number of additional tests reduced to 108.
* * Fixes typo in variable name.
* Removes extra whitespaces.
* Adds int64->int32 casts to prevent possible warnings.
* Problem size reduced in tests to pass tests with llvmpipe.
* supports_op condition moved from unintended position
* This is not needed by the normal use where the result is read
using `tensor_get`, but it allows perf mode of `test-backend-ops`
to properly measure performance.
Some systems report the CPU implementation as "Power11" instead of "POWER11".
The existing CMake logic uses a case-sensitive regular expression to extract
the CPU generation, which fails when the casing doesn't exactly match "POWER".
This patch provides a fix by first converting the string to uppercase before applying the regex.
Signed-off-by: root <root@rheldb2v.pperf.tadn.ibm.com>
Co-authored-by: root <root@rheldb2v.pperf.tadn.ibm.com>
* Fix indentation of code sample in document comment
* Make Whisper::Context#transcribe able to run non-parallel
* Add test for Whisper::Context#transcribe with parallel option
* Follow signature API change of Context#transcribe
* Remove useless variable assignment
* Move simple usage up in README
* Add need help section in README
* Add document on Context#transcribe's parallel option in README
* Update date
* Fix signature of Context.new
* Make Context#subscribe accept n_processors option
* Make test follow #transcribe's change
* Make RBS follow #transcribe's change
* Add document for #transcribe's n_processors option
* Rename test directory so that Rake tasks' default setting is used
This commit updates the build workflow to replace `ports.ubuntu.com`
with `mirror.kumi.systems` in the apt sources list for ARM64 builds.
The motivation for this change is intended to improve package download
reliability and speed by using a more stable mirror for ARM64 packages.
This pull request fixes a bug in the fullTranscribeWithTime method, where the whisperParams argument was declared but never used. As a result, the model did not apply the configuration defined in whisperParams.
This commit add support for language detection in the Whisper Node.js
addon example. It also updates the node addon to return an object
instead of an array as the results.
The motivation for this change is to enable the inclusion of the
detected language in the result, in addition to the transcription
segments.
For example, when using the `detect_language` option, the result will
now be:
```console
{ language: 'en' }
```
And if the `language` option is set to "auto", it will also return:
```console
{
language: 'en',
transcription: [
[
'00:00:00.000',
'00:00:07.600',
' And so my fellow Americans, ask not what your country can do for you,'
],
[
'00:00:07.600',
'00:00:10.600',
' ask what you can do for your country.'
]
]
}
```
* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling
We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.
Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* threading: disable SetThreadInfo() calls for older Windows versions
* Update tools/llama-bench/llama-bench.cpp
Co-authored-by: Diego Devesa <slarengh@gmail.com>
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* 1. add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted code indentation
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Fixed incorrect setting of variable types
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted the judgment logic
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Add a defensive security assert
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted the support judgment logic.
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* revoke the suggest commit changes due to it's not applicable in jetson_device
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Add parentheses to enforce operator precedence
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Fix ci bug: add a spaces
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: yangxiao <yang_xl@tju.edu.cn>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: yangxiao <yangxl_zz@qq.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* SYCL: Add mrope kernel
* feat: Optimize rope operations with vectorization
Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.
* Use ceil_div
* cmake: Define function for querying architecture
The tests and results match exactly those of src/CMakeLists.txt
* Switch arch detection over to new function
* Prevent overflow
* Fix memsize of Whisper::Context
* Rename xxx_initialize to more Ruby-esque name: xxx_s_new
* Define Whisper::Model::ZipURI
* Define Whisper::Model.coreml_compiled_models
* Make Options' @cmake_options Hash
* Use --{enable,disable}-whisper-coreml option for -I/opt/homebrew/opt/llvm/include
* Prepare Core ML model if enabled
* Add test for ZipURI
* Add signatures for ZipURI
* Add Whisper.system_info_str
* Add test for Whisper.system_info_str
* Add signagure for Model.coreml_compiled_models
* Add signature for Whisper.system_info_str
* Add test for Core ML
* Update date
* Maintain .gitignore
* vad : revisit timestamp alignment/mapping
This commit improving the timestamp alignment by introducing a mapping
table, adding intermediate reference points for longer segments, and
binary search for lookups.
The motivation for this changes is to address issues with the currently
solution where zero-length segments are possible, and also to improve
the precision of the VAD timestamps.
Refs: https://github.com/ggml-org/whisper.cpp/issues/3162
* vad : use uint64_t for time mapping
This commit changes the type of the `processed_time` and `original_time`
fields in the `vad_time_mapping` struct from `double` to `uint64_t`.
The motivation for this change is made to improve precision and avoid
floating-point inaccuracies and also be consistent with other part of
the code base that use `uint64_t` for time representation.
This is a part of a refactoring where I'm also going to change the
vad_segment_info struct to use `uint64_t` for the start and end times.
This is the reason for the not so pleasant conversion and casts in the
code at the moment.
* vad : change vad_segment_info and whisper_vad_segment to use uint64_t
* vad : use int64_t instead of uint64_t for timestamps
To be consistent with other timestamps in the codebase.
* vad : add centisecond conversion functions
* vad : extract vad processing from whisper_full_with_state
This commit extracts the VAD processing from the
`whisper_full_with_state` function into the `whisper_full` and
`whisper_full_parallel` functions.
The motivation for this is that I did not take into account that when
`whisper_full_parallel` is called with `n_processors > 1`, then the
vad processing would not be applied correctly. Instead the VAD
processing should be done prior to processing in the case of
`whisper_full_parallel`.
* vad : remove filtered_n_samples from whisper_vad
The commit removes the parameter `filtered_n_samples` from the
`whisper_vad` function signature and its usage, as it is no longer
needed since filtered samples is now a vector (previously it was a
float*)
The motivation for this is to simplify the usage of this function.
* vad : remove vad_mapping_table_initialized flag
* vad : fix leaning (none) of pointer/references
* Don't pass empty string to cmake command
* Refactor Dependencies
* Use found cmake path for options
* Maintain extsources.rb
* List dependent files by directory separator agnostic way
* Prepend whitespace before '='
* Handle build options on install
* Remove useless test
* Retrieve gem file name and version from spec file
* Bump version to 1.3.3
* Update date
* Add install option examples
* [skip ci]Remove unused module
* Add VAD models
* Extract function to normalize model path from ruby_whisper_initialize()
* Define ruby_whisper_vad_params struct
* Add VAD-related features to Whisper::Params
* Add tests for VAD-related features
* Define Whisper::VADParams
* Add Whisper::VAD::Params attributes
* Add test suite for VAD::Params
* Make older test to follow namespace change
* Add test for transcription with VAD
* Add assertion for test_vad_params
* Add signatures for VAD-related methods
* Define VAD::Params#==
* Add test for VAD::Params#==
* Fix Params#vad_params
* Add test for Params#vad_params
* Fix signature of Params#vad_params
* Use macro to define VAD::Params params
* Define VAD::Params#initialize
* Add tests for VAD::Params#initialize
* Add signature for VAD::Params.new
* Add documentation on VAD in README
* Wrap register_callbask in prepare_transcription for clear meanings
* Set whisper_params.vad_params just before transcription
* Don't touch NULL
* Define ruby_whisper_params_type
* Use TypedData_XXX for ruby_whisper_params instead of Data_XXX
* Remove unused functions
* Define rb_whisper_model_data_type
* Use TypedData_XXX for ruby_whisper_model instead of Data_XXX
* Define ruby_whisper_segment_type
* Use TypedData_XXX for ruby_whisper_segment instead of Data_XXX
* Define ruby_whisper_type
* Use TypedData_XXX for ruby_whisper instead of Data_XXX
* Qualify with const
* tests : add a new benchmark test for long-form audio
Based on "Earnings-21" corpus by Del Rio et al.
Earnings-21: A Practical Benchmark for ASR in the Wild (2021)
https://arxiv.org/abs/2104.11348
This dataset contains 39 hours of long-form speech, sourced from public
earning calls. Each recording contains roughly 50 minutes of English
dialogues between multiple speakers (2-20 persons).
This benchmark suite should allow us to evaluate the performance of
whisper.cpp on long-form audio data.
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
* tests : apply PR feedback to 'earnings21/README.md'
Based on feedback from Daniel Bevenius.
- Simplify how to download & prepare a Silero VAD model.
- Fix typo: inferece -> inference
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
* tests : avoid crashing on non-UTF-8 characters
Based on feedback from Daniel Bevenius.
Add 'errors' parameter to open() in order to avoid unhandled
exception on invalid UTF-8 bytes.
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
* tests : try to interpret the hypothesis as Windows-1252
Based on the discussion in PR#3185.
Evidently Whisper.cpp can represent a quotation mark as '0x93', which
implifies Windows-1252 (Microsoft's ASCII excention), and cannot be
decoded by UTF-8.
Add an explicit decoding loop to address the issue.
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
---------
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
This commit modifies windows-blas which was updated previously to use
the zip functionality provided by `actions/upload-artifact`. This turned
out to be incorrect and I should not have done that. The reason for
zipping the archives first is that otherwise the artifacts when
downloaded will be unzipped and just be simple directories. In our case
the release task depends on the artifacts having a .zip extension so
that those archives are include in the release.
* SYCL: Add non contiguous input support to norm kernel
* refactor and add RMS_NORM non contiguous input support
ggml-ci
* restore subgroup reduction for multi-subgroup thread blocks in norm kernels
* Swap grid dims of nsamples and nrows
ggml-ci
* Revert "Swap grid dims of nsamples and nrows"
This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.
* restore not required changes
ggml-ci
* address review comments: change it to more like SYCL
* Use a common function to calculate offset
* remove wrap around logic for handling broadcasts
* remove static from calculate_offset fn and use ceil_div
* cann: add the basic FA support
* cann: update the readme
* cann: update the FlashAttention with PSEShift
* cann: update the input parameters in FA
* cann: update the alibi with max_bias
* cann: add the constrints of softcap
* cann: update the docs CANN.md
* cann: update the docs CANN.md
* cann: fix typo of CANN.md
* cann: add some comments and update the CANN.md
* cann: update the CANN.md
* cann: update the inner precise for fusedInferAttention
* cann: update the constraints of flash_attn_ext on ggml-cann.cpp
* cann: clean the whitespace
* cann: clean the whitespace
* cann: add a new endline
* opencl: Add support for multiple devices
... but limited to one platform. A platform with a GPU will be preferred.
Additionally:
* Filter out devices that lack capabilities needed by the backend
implementation (half support, OpenCL 2.0+, etc).
* Make ggml_backend_opencl_reg() thread-safe.
* fixup: fix an error in sync_with_other_backends
... when there is only one OpenCL device available.
* opencl: fix couple crashes
* fix kernel launches failed on devices which do not support
non-uniform work-groups. When non-uniform work-groups are not
supported, set `local_work_size` to NULL (= let driver choose the
work-group sizes). This patch does not cover everything - just the
cases tested by test-backend-ops.
* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.
* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
* musa: fix build warning (unused parameter)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: upgrade MUSA SDK version to rc4.0.1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Update ggml/src/ggml-cuda/cpy.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Remove mmap workaround on windows
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
* Update llama-bench README
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
This commit updates the README_sycl.md file to use UTF-8 encoding.
The motivation for this is that while this file displays correctly in
github it will fail to render with tools that expect UTF-8 encoding.
For example this is the case when using `grip` to view the file locally.
This commit enable the node addon to suppress all output, even the
result of the transcription if the no_prints parameter is set to true.
The motivation for this is that for the node addon there is a
fullfilment handler/success callback to process the transcription
result. And it might be useful to be able to disable the printing of
the transcription result to the console, so that the user can handle
the result in their own way.
Refs: https://github.com/ggml-org/whisper.cpp/issues/3176
Quick fix for not removing swedish umlauts.
* Update talk-llama.cpp
Expose model inference settings to user instead of hard coding them. Same defaults as previous defaults.
* Update examples/talk-llama/talk-llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ci : use dynamic libopenblas.dll for window-blas
This commit updates the windows-blas job to use the dynamic (can load
different kernels depending of the CPU arch) libopenblas.dll instead of
the "static" openblas.dll that get installed by vcpgk.
The motivation for this change is that there have been reports of
performance drops in later version specifically related to blas. Please
see the links below for more details.
Resolves: https://github.com/ggml-org/whisper.cpp/issues/3166
Refs: https://github.com/ggml-org/whisper.cpp/issues/2666#issuecomment-2885978811
This commit removes some redundant assignments in the function
`whisper_exp_compute_token_level_timestamps`.
The motivations for this is that tokens[j] and token are references to
the same object and this can be a little confusing when reading the
code.
* Fix CMakeLists.txt to handle deprecated gpu Warnings
* Conditionally apply -Wno-deprecated-gpu-targets only when GGML_CUDA is enabled
* Conditionally apply -Wno-deprecated-gpu-targets only when GGML_CUDA is enabled and not MSVC
---------
Co-authored-by: Jugal Sheth <jugal.sheth@marineai.co.uk>
This commit adds the `GGML_SYCL_DNN` option to the Ruby bindings for
the GGML library. This option as added to ggml in
Commit (5e7e07758a5f3172380500e173ca71f679bbef1e "sycl: use oneDNN for
matrices multiplication")
The motivation for this change to enable the CI build to pass.
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
* batched-bench : fix pp batch contents
* metal : optimize multi-sequence FA vec kernel
ggml-ci
* metal : use FA-vec kernel up to batch size 20
ggml-ci
This commit adds a check to `whisper_full_with_state` and if no VAD
segments are detected, the function will return early.
The motivation for this is that if no VAD segments are detected, the
function will not have any samples to process which can happen if an
audio sample does not contain any speech. I did not test this previously
and only discovered this when updating the stream example.
* vad : store VAD context in whisper_state
This commit stores the VAD context in the whisper_state structure,
allowing for better management and reuse of the VAD context across
multiple calls to the whisper_vad function.
The motivation for this change is that when updating the stream example
I noticed that the VAD context was being re-initialized every time the
whisper_vad function was called. This involved loading the VAD model
which is expensive and unnecessary if the context can be reused.
Storing this in the whisper_state seems follow the pattern simliar to
how whisper_coreml_context and whisper_openvion_context are stored.
* vad : free vad_context in whisper_free_state
This commit add `build_*/` to `.gitignore` to ignore all build
directories that start with `build_`.
The motivation for this is that the Go bindings creates a directory
named build_go, which is not ignored by the current .gitignore. I was
not sure if changing this to build-go could effect exising users so I
opted to update .gitignore instead.
* examples : add --print-confidence option to cli
This commit adds a new command-line option `--print-confidence` to the
whisper-cli. When enabled, this option prints the confidence level of each
token in the transcribed text using ANSI formatting codes.
The confidence levels are represented using different styles:
```console
main: confidence: highlighted (low confidence), underlined (medium), dim (high confidence)
```
Refs: https://github.com/ggml-org/whisper.cpp/issues/3135
* vad : add download-vad-model scripts
This commit adds a script to download VAD models.
* vad : add vad model download script for windows [no ci]
Refs: https://github.com/ggml-org/whisper.cpp/issues/3146
This commit adds the `--flash-attn` option to the usage output of the
server example.
The motivation for this change is that while it is possible to set this
option it is not printed in the usage output.
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
* vulkan: scalar flash attention implementation
* vulkan: always use fp32 for scalar flash attention
* vulkan: use vector loads in scalar flash attention shader
* vulkan: remove PV matrix, helps with register usage
* vulkan: reduce register usage in scalar FA, but perf may be slightly worse
* vulkan: load each Q value once. optimize O reduction. more tuning
* vulkan: support q4_0/q8_0 KV in scalar FA
* CI: increase timeout to accommodate newly-supported tests
* vulkan: for scalar FA, select between 1 and 8 rows
* vulkan: avoid using Float16 capability in scalar FA
* sycl : Implemented reorder Q4_0 mmvq
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
* sycl : Fixed mmvq being called when reorder is disabled
* sycl : Improved comments in the quants header
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
* Use static_assert
* safe_div -> ceil_div
* Clarify qi comment
* change the reorder tensor from init to execute OP
* dbg
* Undo changes to test-backend-ops
* Refactor changes on top of q4_0 reorder fix
* Missing Reverts
* Refactored opt_for_reorder logic to simplify code path
* Explicit inlining and unroll
* Renamed mul_mat_algo enum for consistency
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
Co-authored-by: romain.biessy <romain.biessy@codeplay.com>
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:
GGML_ASSERT(nei0 * nei1 <= 3072);
The tensor is 8 x 512. Increase this array size to accommodate.
This commit adds an example that demonstrates how to use a VAD (Voice
Activity Detection) model to segment an audio file into speech segments.
Resolves: https://github.com/ggml-org/whisper.cpp/issues/3144
* vad : add initial Voice Activity Detection (VAD) support
This commit add support for Voice Activity Detection (VAD). When enabled
this feature will process the audio input and detect speech segments.
This information is then used to reduce the number of samples that need
to be processed by whisper_full.
Resolves: https://github.com/ggml-org/whisper.cpp/issues/3003
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit adds a description of the color scheme used in the CLI
when the --print-colors option is enabled.
The motivation for this is that it is not immediately clear what the
color scheme is when using the CLI with the --print-colors option.
Example output:
```console
$ ./build/bin/whisper-cli -f samples/jfk.wav --print-colors
...
main: color scheme: red (low confidence), yellow (medium), green (high confidence)
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
```
The description will not be dispayed if the `--no-prints` options is
set.
Refs: https://github.com/ggml-org/whisper.cpp/issues/3135
This commit updates the link to Paul Tol's color scheme in the
`examples/common.h` file. The previous link was outdated and
pointed to a non-existent page.
* Test Ruby bindings' extra options only when commanded
* ruby : test extra build options only when env var specified
* Fix extra_options
* Update gem date
This commit omits the test for `test_build_options` when run locally as
it currently fails on Linux and MacOS platforms.
`
The motivation for this change is that currently when running the tests
locally on a non-macOS platform the test fails with the following error:
```console
.F
========================================================================
Failure: test_build_options(TestPackage):
<["ACCELERATE_FRAMEWORK",
"CMAKE_OSX_ARCHITECTURES",
"CMAKE_OSX_SYSROOT",
"FOUNDATION_LIBRARY",
"METALKIT_FRAMEWORK",
"METAL_FRAMEWORK"]> was expected to be empty.
/home/danbev/work/ai/whisper.cpp/bindings/ruby/tests/test_package.rb:43:in `test_build_options'
40: options = BuildOptions::Options.new
41: assert_empty options.missing_options
42: unless ENV["CI"]
=> 43: assert_empty options.extra_options
44: end
45: end
46: end
========================================================================
```
This commit adds HEAPU8 to the list of exported methods.
The motivation for this commit is that currently this is causing an error on Window systems where HEAPU8 in undefined, which results in the following error message in the web console:
main.js:1 Uncaught TypeError:
Cannot read properties of undefined (reading 'buffer') at __emval_get_property
(main.js:1:1363125) at 003a453a:0xc4a47 at 003a453a:0xc51cd at
Object.full_default (eval at craftInvokerFunction (main.js:1:1347011),
<anonymous>:9:10) at whisper.cpp/:647:42
danbev originally fixed this for whisper.wasm, stream.wasm, and command.stream, but the issue still exists on the other examples which I patch in this code.
Resolves: #3059
This commit updates the documentation for the WASM examples to include a
note about the generation of the `worker.js` file. As of Emscripten
3.1.58 (April 2024), separate worker.js files are no longer generated
and the worker is embedded in the main JS file.
The motivation for this change is to inform users about the new behavior
of Emscripten and why the `worker.js` file may not be present.
Refs: https://github.com/ggml-org/whisper.cpp/issues/3123
* whisper : deprecate WHISPER_CCACHE CMake option
This commit deprecates the WHISPER_CCACHE CMake option in favor of
the GGML_CCACHE option.
The motivation for this change is that currently when setting, or not
setting WHISPER_CCACHE, the outut message from ggml will be that to
enable ccache you need to set GGML_CCACHE which can be confusing.
This also seems to be inline with what llama.cpp does which does not
have a LLAMA_CCACHE option as far as I know.
Resolves: https://github.com/ggml-org/whisper.cpp/issues/3063
* ruby : change "WHISPER_CCACHE" to "GGML_CCACHE"
* ruby : move GGML_CCACHE to sorted position
* stream.wasm : add HEAPU8 to exported runtime methods
This commit adds HEAPU8 to the list of exported methods for stream.wasm.
The motivation for this is that without it HEAPUD8 will be undefined
and when its 'buffer' attribute is accessed this will cause error as
reported in the referenced issue.
Note that to test this make sure that the web browsers caches is cleared
first.
Resolves: https://github.com/ggml-org/whisper.cpp/issues/3123
* command.wasm : add HEAPU8 to exported runtime methods
This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type.
This change results in 9x - 40x gains
in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark.
The patch is tested with Meta-Lllama-3-8B,
and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine.
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
The following scenario will cause an assertion failure in the graph
allocator:
- Build and allocate a graph containing a tensor with a non-NULL data
pointer
- Build and allocate a new graph where that data is NULL
Result:
ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed
This happens during revalidation because we think that memory should
have been previously allocated based on the current graph but in
reality the previous graph was different. In this situation, we
should do a full reallocation pass.
* vulkan: Add bfloat16 support
This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16.
The extension is required for coopmat multiply support, but matrix-vector
multiply trivially promotes bf16 to fp32 and doesn't require the extension.
The copy/get_rows shaders also don't require the extension.
It's probably possible to fall back to non-coopmat and promote to fp32 when
the extension isn't supported, but this change doesn't do that.
The coopmat support also requires a glslc that supports the extension, which
currently requires a custom build.
* vulkan: Support bf16 tensors without the bf16 extension or coopmat support
Compile a variant of the scalar mul_mm shader that will promote the bf16
values to float, and use that when either the bf16 extension or the coopmat
extensions aren't available.
* vulkan: bfloat16 fixes (really works without bfloat16 support now)
* vulkan: fix spirv-val failure and reenable -O
This commit adds steps to the windows jobs to zip and upload
artifacts produced.
The motivation for this is that currently the artifacts are not zipped
which means that will not be picked up by the release job and hence not
be included in github releases.
Resolves: https://github.com/ggml-org/whisper.cpp/issues/3119
This commit add the .zip extension to the xcframework artifact name in
the GitHub Actions workflow.
The motivation for this that the release job will look for .zip files
and will not find the xcframework artifact without the extension, and
hence will not upload it to the release.
* ggml : remove MSVC warnings pragmas
This commit removes the MSVC-specific pragmas as these are now handled
in CMakeLists.txt.
* whisper : remove MSVC warning pragmas
This commit removes the MSVC-specific pragmas. These are now handled in
the CMakeLists.txt file.
This changes examples/cli/cli.cpp to be like
examples/common-whisper.cpp. "-of -" can be specified (or this can be
inferred from "-" as the input file) to output to stdout. This is useful
for piping to other applications.
Log fname_out consistently when not stdout
- Terminals have stdout=stderr, so remove the message before
successful output to ease copying
- Don't affect actual error messages
- Move opening the ofstream into the factory, fixing missing
open and/or error messages in output_score/output_wts
- Fix struct naming convention
Closes#3048
* docs : Update cli documentation
This updates the documentation of cli based on the actual output
In the longterm this should ideally be auto generated to prevent mismatch
* docs : Update cli documentation
This updates the documentation of cli based on the actual output
In the longterm this should ideally be auto generated to prevent mismatch
* Use cache file when model host doesn't support if-modified-since
* Update gem date
* Revert "ruby : ignore "Downloading" output in test_log_suppress (#3106)"
This reverts commit edbd4cb7f5.
Build fails with compilation error on power pc.
This patch fixes the same.
Tested with unit tests run via
--build <build_dir> && cd <build_dir> && make test
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
* fix(rpc): Improve input validation and error handling
The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.
This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:
- **Type Validation:** `deserialize_tensor` now checks if the
`tensor->type` is within the valid `GGML_TYPE_COUNT` range
*before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
`set_tensor_hash`, and `get_tensor` handlers with error
logging and returning `false` when data/offset parameters
are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
`graph_compute` when calculating required message sizes based
on client-provided `n_nodes` and `n_tensors`. Returns early
if the reported sizes conflict with the actual message size or
would lead to overflow.
- **Error Propagation:**
- `create_node` now checks for `nullptr` return values from
`deserialize_tensor` and its recursive calls, propagating
`nullptr` upwards on failure. Uses `find` instead of `at`
for safer map access.
- `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
and sets the response status to failure if deserialization
or bounds checks fail.
- `graph_compute` now checks for `nullptr` return from
`create_node` and returns failure status correctly. The final
return value now reflects the actual computation status.
These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): address pr comments
removed comments and unnecessary returns
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): ambiguous nullptr from create_node
rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).
This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
`create_node` returns nullptr, correctly identifying failures
versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
nullptr unambiguously on failure during recursion.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): initial zero check in create_node
The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.
Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* fix(rpc): Handle get_alloc_size failure in server
Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): input size validation in graph_compute
Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove extra status code setting
Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove redundant check for tensor->type
Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
---------
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* SYCL: Add all missing unary kernels
ggml-ci
* decouple kernel launch range from data size using strided loop
* use ciel_div helper for num_blocks
ggml-ci
* clean auto imported header files
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.
The performance impact of this change depends on the network latency.
This commit adds a check to makes sure that the target exists before
trying to add compile options to ignore warnings when using MSVC.
The motivation for this is currently the build is broken depending on
the cmake options provided. With this fix it should be possible to build
even if the targets are not actually available.
Refs: https://github.com/ggml-org/whisper.cpp/pull/3090#issuecomment-2842760104
This commit adds the the command line option `--no-gpu` to the server
examples print usage function.
The motivation for this is that this options is available and can be set
but it is not displayed in the usage message.
Refs: https://github.com/ggml-org/whisper.cpp/issues/3095
This commit adds a temporary fix to the `test_log_suppress` test in the
Ruby bindings.
The motivation for this changes is that I suspect that the recent
migration of the models to HuggingFace Xet has changed the way HTTP
caching works for the models. This is causing the test in question to
fail. This is a temporary fix so that CI is not broken while we
investigate this further.
* whisper: suppress Windows compiler warnings
This commit disables compiler warnings on window using MSVC.
The motivation for these changes is that some compilers generate
warnings for these conversion, for example Windows MSVC, and
there are quite a few of them. This makes it a little difficult to
spot new warnings that may be introduced and also can be difficult
for users/embedders of ggml where these warnings are hard to separate
from their own warnings.
* squash! whisper: suppress Windows compiler warnings
Move ggml related warnings into ggml. This commit also fixes the
indentation and adds a missing whitespace to the if statement.
This commit addresses a warnings that is present for Release builds:
```console
[ 30%] Building CXX object src/CMakeFiles/whisper.dir/whisper.cpp.o
In file included from /usr/include/c++/13/bits/stl_tree.h:63,
from /usr/include/c++/13/map:62,
from /home/danbev/work/ai/whisper.cpp/src/whisper-arch.h:5,
from /home/danbev/work/ai/whisper.cpp/src/whisper.cpp:2:
In static member function ‘static void std::__copy_move<false, false, std::random_access_iterator_tag>::__assign_one(_Tp*, _Up*) [with _Tp = const whisper_grammar_element*; _Up = const whisper_grammar_element* const]’,
inlined from ‘static _Up* std::__copy_move<_IsMove, true, std::random_access_iterator_tag>::__copy_m(_Tp*, _Tp*, _Up*) [with _Tp = const whisper_grammar_element* const; _Up = const whisper_grammar_element*; bool _IsMove = false]’ at /usr/include/c++/13/bits/stl_algobase.h:440:20,
inlined from ‘_OI std::__copy_move_a2(_II, _II, _OI) [with bool _IsMove = false; _II = const whisper_grammar_element* const*; _OI = const whisper_grammar_element**]’ at /usr/include/c++/13/bits/stl_algobase.h:506:30,
inlined from ‘_OI std::__copy_move_a1(_II, _II, _OI) [with bool _IsMove = false; _II = const whisper_grammar_element* const*; _OI = const whisper_grammar_element**]’ at /usr/include/c++/13/bits/stl_algobase.h:533:42,
...
```
This warning is caused by the fact that the `stack` vector is empty
when it is passed to `new_stacks.push_back(stack);`.
The suggested fix is to use `new_stacks.emplace_back();` instead of
`new_stacks.push_back(stack);`.
This commit removes the empty `.gitmodules` file from the repository.
The motivation of this is that this file is currently empty and the
project does not use any submodules at this time. Removing it mainly to
reduce clutter in the repository and any confusion when seen the file
in repo.
This commit disables the publishing of the Java binding to the Maven
repository.
The motivation for this is that this job was disabled for some time and
recently it was re-enabled, but the publishing of the Java binding
caused the build to fail and needs to be investigated further.
Refs: https://github.com/ggml-org/whisper.cpp/issues/3079
* Update PATH for main/main-cuda container
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Add Dockerfile for musa, .dockerignore and update CI
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Add Moore Threads GPU Support in README.md and replace ./main with whisper-cli
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Forward GGML_CUDA/GGML_MUSA to cmake in Makefile
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Minor updates for PATH ENV in Dockerfiles
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Address comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Lazy run TestBase.whisper
* Fix indentation
* Remove disused GGML_HIP_UMA from Ruby
* Add encoder_begin_callback
* Comment out existing abort mechanism
* Add test for encoder_begin_callback
* Add signatures for encoder_begin_callback related methods
* Update gem date
* tune matmul for gcn
* this one is more power efficient
* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp
Co-authored-by: 0cc4m <picard12@live.de>
* disable this tune for the proprietary driver
---------
Co-authored-by: 0cc4m <picard12@live.de>
Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org
Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465
* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci
Submit operators using asynchronous threads to improve performance.
Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.
Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.
split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.
Multiple optional memory pools are provided for CANN, including VMM,
priority queue-based, and traditional memory pools.
1.When the memory pool is available and GGML_CANN_DISABLE_VMM_POOL
is not defined, the VMM pool is selected by default.
2.Otherwise, if GGML_CANN_ENABLE_BUF_PRIO_POOL is defined,
the priority queue-based memory pool is used.
3.If neither condition is met, the default memory pool is used.
The current usage of the SYCL-Graph extension checks for
the `sycl_ext_oneapi_graph` device aspect. However, it is also
possible to support `sycl_ext_oneapi_limied_graph` devices that
don't support update
* SYCL: Add fp16 support to some elementwise OP kernels
* remove comment
ggml-ci
* Use static_cast directly
* remove not needed cast from tanh
* Use static cast and remove unneeded castings
* Adjust device_support_op for unary OPs
* Use cast_data and typed_data struct to deduplicate casting code
* [CANN] Support ELU and CONV_TRANSPOSE_1D
* [CANN]Modification review comments
* [CANN]Modification review comments
* [CANN]name adjustment
* [CANN]remove lambda used in template
* [CANN]Use std::func instead of template
* [CANN]Modify the code according to the review comments
---------
Signed-off-by: noemotiovon <noemotiovon@gmail.com>
q4_k and q5_k had a lot of redundant global loads where the same 16B of
scale information is repeatedly loaded and decoded during each loop iteration.
This change restructures the loops to more explicitly iterate over whole
blocks in the outer loop (with unrolled inner loop) and to copy/decode the
scale data into shared memory once at the start of each outer loop. The copy
is pipelined so the scale load from global memory is relatively cheap.
This improves q4_k/q5_k model prompt processing performance by around 5-7%.
I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k
and hurt for q4_0.
The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped
variants isn't used as often as it originally was (e.g. due to the padded_N
change), so I trimmed it down to offset some of the new complexity of the
semi-manual loop unrolling.
* ggml : FA supports F32 V
* graph : cast KV to F16 when the KV cache is not used
ggml-ci
* server : add test that exercises embeddings with FA enabled
ggml-ci
nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple
of the number of rows in the matrix. The KV dim is a multiple of the number of
columns for the aligned shader.
There seems to be a bubble waking up from waitForFences, which costs a few
percent performance and also increased variance in performance. This change
inserts an "almost_ready" fence when the graph is about 80% complete and we
waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting
for the final fence to be signaled.
* Prefer vector flash decoding kernel for Gemma models
Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.
* Update ggml/src/ggml-cuda/fattn.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers
Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.
Fixes#12152
* Addressed comments
* fix HIP builds
* properly sync to stream
* removed ggml_cuda_cpy_fn_ptrs
* move stream sync before free
* guard to only use indirection with graphs
* style fixes
* check for errors
---------
Co-authored-by: slaren <slarengh@gmail.com>
When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.
When adjacent batches of Q share the same batches of K/V, batch them into
the same workgroup. For example, when:
dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))
previously we would run 32 workgroups computing 1 result each, now we will
run 8 workgroups computing 4 results each.
This doesn't directly translate to better performance (at least when you have
>=32 SMs), but in a subsequent change I'll enable split_k which will scale much
better with 4x fewer workgroups.
* add bf16 support
* use convert_from_bf16_cuda instead of convert_unary_cuda for f32
* revert 7ec5085
* move functionality into convert_unary with constexpr
* coreml : skip model load in convert-whisper-to-coreml.py
This commit updates the conversion process for Whisper models to use the
"mlprogram" format instead of "neuralnetwork".
The motivation for this change is that when using the "neuralnetwork"
format the underlying model produced is based on protobuf and my
understanding is that there are limitations to this format, such as
sizes of strings and the complexity of the model.
Currently when trying to convert larger models such as large-v3 the
conversion fails but succeeds for smaller models.
The "mlprogram" format is a more recent addition to CoreML and is
designed to be more flexible and powerful, allowing for more complex
models and larger data types. This seems to work for larger and smaller
models alike and unless I'm there are considerations that I'm not aware
of I think this is what we should be using moving forward.
The error that is generated for large models is the following:
```console
Running MIL backend_neuralnetwork pipeline: 100%|█████████| 9/9 [00:00<00:00, 35.44 passes/s]
Translating MIL ==> NeuralNetwork Ops: 100%|███████████| 5641/5641 [03:31<00:00, 26.65 ops/s]
Traceback (most recent call last):
File "/Users/danbev/work/ai/whisper-work/models/convert-whisper-to-coreml.py", line 322, in <module>
encoder = convert_encoder(hparams, encoder, quantize=args.quantize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/models/convert-whisper-to-coreml.py", line 255, in convert_encoder
model = ct.convert(
^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/converters/_converters_entry.py", line 635, in convert
mlmodel = mil_convert(
^^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 186, in mil_convert
return _mil_convert(
^^^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 245, in _mil_convert
return modelClass(
^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/models/model.py", line 489, in __init__
self.__proxy__, self._spec, self._framework_error = self._get_proxy_and_spec(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/models/model.py", line 550, in _get_proxy_and_spec
_MLModelProxy(
ValueError: basic_string
```
Refs: https://github.com/ggml-org/whisper.cpp/issues/3012
This commit disables the FreeBSD job in build.yml of the GitHub Actions
workflow.
The motivation for this is that this job seems to stall and timeout from
time to time, taking up to 6 hours to complete/cancel.
This commit adds `HEAPU8` to the list of exported methods.
The motivation for this commit is that currently this is causing an
error on Window systems where HEAPU8 in undefined, which results in the
following error message in the web console:
```console
main.js:1 Uncaught TypeError:
Cannot read properties of undefined (reading 'buffer') at __emval_get_property
(main.js:1:1363125) at 003a453a:0xc4a47 at 003a453a:0xc51cd at
Object.full_default (eval at craftInvokerFunction (main.js:1:1347011),
<anonymous>:9:10) at whisper.cpp/:647:42
```
Resolves: https://github.com/ggml-org/whisper.cpp/issues/3059
* Fix signature of URI.new7s return value
* Use path instead of string | _ToPath
* Add document comment to RBS
* Remove unnecessary build flags
* Remove unnecessary line
* Remove files have become unnecessary
* Make gem install accept build options for whisper.cpp
* Add instraction for build options in README
* Add methods for check to Options
* Test build options
* Rename: configs -> options
* Add assert_installed assertion
* Use assert_installed
* Remove unused attribute
* Extract dependency check logic as Dependencies class
* Update README
* Add WHISPER_FFMPEG option
* Test extra build options only on local test
* Bump version to 1.3.2 [skip ci]
FFmpeg introduced a new channel layout API that uses `AVChannelLayout`
interface in v6.0. It subsequently dropped the old bitmask-based API
in v7.0.
This updates decode_audio() to support the new channel layout API,
so that we can compile `whisper-cli` and `whisper-server` with FFmpeg
v7.0 or later.
Tested on on Ubuntu 24.10 with FFmpeg v7.0.2.
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
* Use CMake to build shared object
* Make Rakefile follow change of build process
* Add test for packaging
* Run CI for Ruby bindings almost always
because each CMakeLists.txt might affect Ruby bindings
* Enable PIC
* Bump Ruby version to 3.2 on CI
* Check libgomp
* Check dependency of whisper.cpp accurately
FFmpeg integration was introduced in 1b51fdf by William Tambellini,
but not mentioned in the main documentation.
Add a short guide on how to enable the feature. Confirmed to work
on both Ubuntu 24.04 and Fedora 39.
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
This commit adds a check for the visionos build version used with vtool
in build-xcframework.sh. The script now checks the Xcode version and
determines whether to use "xros" or "visionos" for the build version.
This commit also uses xcrun for the vtool so that the version of vtool
in xcode command line tools is used instead of the one in the system
path.
Refs: https://github.com/ggml-org/whisper.cpp/pull/2994#issuecomment-2773292223
* tests : add script to benchmark whisper.cpp on LibriSpeech corpus
LibriSpeech is a widely-used benchmark dataset for training and
testing speech recognition models.
This adds a set of scripts to measure the recognition accuracy of
whisper.cpp models, following the common benchmark standards.
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
* Document how to prepare `whisper-cli` and model files
Feedback from Daniel Bevenius.
This adds a short code example how to prepare the `whisper-cli`
command, to make the initial setup step a little bit clearer.
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
* tests : Simplify how to set up Python environment
Based on a feedback from Georgi Gerganov.
Instead of setting up a virtual environment in Makefile, let users
set up the Python environment. This is better since users may have
their own preferred workflow/toolkit.
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
---------
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
The benchmark script 'scripts/bench-all.sh' assumes that the 11th
field of the output line is a timestamp. This assumption does not
hold when the target model takes a bit longer to process.
Fix this issue by introducing an explicit whitespace to the output
lines of `whisper_print_timings()`.
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
This commit updates examples/server.py which is used to serve the wasm
examples locally. The changes include:
- Added a redirect from the root URL to /whisper.cpp.
So now accessing http://localhost:8000/ will redirect to
http://localhost:8000/whisper.cpp/ which matches the url for the app
deployed to github pages.
- Custom handling for coi-serviceworker.js to serve it to avoid
and error in the console. This file is not strictly necessary
for the local server to work as the headers are provided already but
it is nice to not have an error in the console.
- Fixed the shutdown of the server to ensure it exits cleanly
on Ctrl+C. Previously it would continue to hang onto the port even
after the processed had exited.
* whisper.wasm : fix unknown language issue
This commit addresses an issue with whisper.wasm where the following
error was being displayed when running the application in github pages:
```
whisper_lang_id: unknown language 'д=␙c'
```
This turned out to be a memory corruption issue and further details
can be found in the reference issue below.
Refs: https://github.com/ggerganov/whisper.cpp/issues/2998
* cpu: refactor SIMD mappings and vectorized op functions into separate files
* Fix warning for ggml_float to float
* Fix warnings
* cpu: move all the operations (except mul_mat) to a separate c++ file
* fix whitespace
* Update ggml/src/ggml-cpu/vec.h
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Fix PR comments - use GGML_UNUSED, use cassert in ops.cpp
* Reverse the order of import for ops.h and vec.h, to match what was present in ggml-cpu.c previously
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
This adds a section to the README.md file that describes how to use the
XCFramework.
The modification for this is that is not obvious how to use the
XCFramework and and example will help.
One thing to note is that the example is using the latest release
including the checksum. We are thinking about how we might automate
this in the future but for now this is a good start.
- [x] Linux / [FreeBSD](https://github.com/ggerganov/whisper.cpp/issues/56#issuecomment-1350920264)
- [x] Linux / [FreeBSD](https://github.com/ggml-org/whisper.cpp/issues/56#issuecomment-1350920264)
- [x] [WebAssembly](examples/whisper.wasm)
- [x] Windows ([MSVC](https://github.com/ggerganov/whisper.cpp/blob/master/.github/workflows/build.yml#L117-L144) and [MinGW](https://github.com/ggerganov/whisper.cpp/issues/168)]
- [x] Windows ([MSVC](https://github.com/ggml-org/whisper.cpp/blob/master/.github/workflows/build.yml#L117-L144) and [MinGW](https://github.com/ggml-org/whisper.cpp/issues/168))
@ -225,7 +225,7 @@ speed-up - more than x3 faster compared with CPU-only execution. Here are the in
The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format.
Next runs are faster.
For more information about the Core ML implementation please refer to PR [#566](https://github.com/ggerganov/whisper.cpp/pull/566).
For more information about the Core ML implementation please refer to PR [#566](https://github.com/ggml-org/whisper.cpp/pull/566).
## OpenVINO support
@ -267,7 +267,7 @@ This can result in significant speedup in encoder performance. Here are the inst
- Build `whisper.cpp` with OpenVINO support:
Download OpenVINO package from [release page](https://github.com/openvinotoolkit/openvino/releases). The recommended version to use is [2023.0.0](https://github.com/openvinotoolkit/openvino/releases/tag/2023.0.0).
Download OpenVINO package from [release page](https://github.com/openvinotoolkit/openvino/releases). The recommended version to use is [2024.6.0](https://github.com/openvinotoolkit/openvino/releases/tag/2024.6.0). Ready to use Binaries of the required libraries can be found in the [OpenVino Archives](https://storage.openvinotoolkit.org/repositories/openvino/packages/2024.6/)
After downloading & extracting package onto your development system, set up required environment by sourcing setupvars script. For example:
@ -310,7 +310,7 @@ This can result in significant speedup in encoder performance. Here are the inst
The first time run on an OpenVINO device is slow, since the OpenVINO framework will compile the IR (Intermediate Representation) model to a device-specific 'blob'. This device-specific blob will get
cached for the next run.
For more information about the OpenVINO implementation please refer to PR [#1037](https://github.com/ggerganov/whisper.cpp/pull/1037).
For more information about the OpenVINO implementation please refer to PR [#1037](https://github.com/ggml-org/whisper.cpp/pull/1037).
@ -388,8 +444,9 @@ Run the inference examples as usual, for example:
We have two Docker images available for this project:
1. `ghcr.io/ggerganov/whisper.cpp:main`: This image includes the main executable file as well as `curl` and `ffmpeg`. (platforms: `linux/amd64`, `linux/arm64`)
2. `ghcr.io/ggerganov/whisper.cpp:main-cuda`: Same as `main` but compiled with CUDA support. (platforms: `linux/amd64`)
1. `ghcr.io/ggml-org/whisper.cpp:main`: This image includes the main executable file as well as `curl` and `ffmpeg`. (platforms: `linux/amd64`, `linux/arm64`)
2. `ghcr.io/ggml-org/whisper.cpp:main-cuda`: Same as `main` but compiled with CUDA support. (platforms: `linux/amd64`)
3. `ghcr.io/ggml-org/whisper.cpp:main-musa`: Same as `main` but compiled with MUSA support. (platforms: `linux/amd64`)
Use the [scripts/bench-wts.sh](https://github.com/ggerganov/whisper.cpp/blob/master/scripts/bench-wts.sh) script to generate a video in the following format:
Use the [scripts/bench-wts.sh](https://github.com/ggml-org/whisper.cpp/blob/master/scripts/bench-wts.sh) script to generate a video in the following format:
```bash
./scripts/bench-wts.sh samples/jfk.wav
@ -597,7 +654,7 @@ In order to have an objective comparison of the performance of the inference acr
use the [whisper-bench](examples/bench) tool. The tool simply runs the Encoder part of the model and prints how much time it
took to execute it. The results are summarized in the following Github issue:
Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin
```
And it can then be used with whisper as follows:
```console
$ ./build/bin/whisper-cli \
--file ./samples/jfk.wav \
--model ./models/ggml-base.en.bin \
--vad \
--vad-model ./models/silero-v5.1.2-ggml.bin
```
### VAD Options
* --vad-threshold: Threshold probability for speech detection. A probability
for a speech segment/frame above this threshold will be considered as speech.
* --vad-min-speech-duration-ms: Minimum speech duration in milliseconds. Speech
segments shorter than this value will be discarded to filter out brief noise or
false positives.
* --vad-min-silence-duration-ms: Minimum silence duration in milliseconds. Silence
periods must be at least this long to end a speech segment. Shorter silence
periods will be ignored and included as part of the speech.
* --vad-max-speech-duration-s: Maximum speech duration in seconds. Speech segments
longer than this will be automatically split into multiple segments at silence
points exceeding 98ms to prevent excessively long segments.
* --vad-speech-pad-ms: Speech padding in milliseconds. Adds this amount of padding
before and after each detected speech segment to avoid cutting off speech edges.
* --vad-samples-overlap: Amount of audio to extend from each speech segment into
the next one, in seconds (e.g., 0.10 = 100ms overlap). This ensures speech isn't
cut off abruptly between segments when they're concatenated together.
## Examples
There are various examples of using the library for different projects in the [examples](examples) folder.
@ -668,13 +834,13 @@ Some of the examples are even ported to run in the browser using WebAssembly. Ch
| [whisper.android](examples/whisper.android) | | Android mobile application using whisper.cpp |
| [whisper.nvim](examples/whisper.nvim) | | Speech-to-text plugin for Neovim |
| [generate-karaoke.sh](examples/generate-karaoke.sh) | | Helper script to easily [generate a karaoke video](https://youtu.be/uj7hVta4blM) of raw audio capture |
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators<EFBFBD>such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.
oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.
Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.
To avoid re-inventing the wheel, this code refers other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) (Commercial release [Intel<EFBFBD> DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) migrate to SYCL.
To avoid re-inventing the wheel, this code refers other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) migrate to SYCL.
The whisper.cpp for SYCL is used to support Intel GPUs.
@ -84,10 +84,10 @@ Platform #0: Intel(R) OpenCL HD Graphics
a. Please follow the procedure in [Get the Intel<EFBFBD> oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).
a. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).
Recommend to install to default folder: **/opt/intel/oneapi**.
See whisper.cpp's [README](https://github.com/ggml-org/whisper.cpp/blob/master/README.md) for available options. You need convert options present the README to Ruby-style options, for example:
For boolean options like `GGML_CUDA`, the README says `-DGGML_CUDA=1`. You need strip `-D`, prepend `--enable-` for `1` or `ON` (`--disable-` for `0` or `OFF`) and make it kebab-case: `--enable-ggml-cuda`.
For options which require arguments like `CMAKE_CUDA_ARCHITECTURES`, the README says `-DCMAKE_CUDA_ARCHITECTURES="86"`. You need strip `-D`, prepend `--`, make it kebab-case, append `=` and append argument: `--cmake-cuda-architectures="86"`.
@ -99,9 +127,80 @@ See [models][] page for details.
Currently, whisper.cpp accepts only 16-bit WAV files.
### Voice Activity Detection (VAD) ###
Support for Voice Activity Detection (VAD) can be enabled by setting `Whisper::Params`'s `vad` argument to `true` and specifying VAD model:
```ruby
Whisper::Params.new(
vad:true,
vad_model_path:"silero-v5.1.2",
# other arguments...
)
```
When you pass the model name (`"silero-v5.1.2"`) or URI (`https://huggingface.co/ggml-org/whisper-vad/resolve/main/ggml-silero-v5.1.2.bin`), it will be downloaded automatically.
Currently, "silero-v5.1.2" is registered as pre-converted model like ASR models. You also specify file path or URI of model.
If you need configure VAD behavior, pass params for that:
```ruby
Whisper::Params.new(
vad:true,
vad_model_path:"silero-v5.1.2",
vad_params:Whisper::VAD::Params.new(
threshold:1.0,# defaults to 0.5
min_speech_duration_ms:500,# defaults to 250
min_silence_duration_ms:200,# defaults to 100
max_speech_duration_s:30000,# default is FLT_MAX,
speech_pad_ms:50,# defaults to 30
samples_overlap:0.5# defaults to 0.1
),
# other arguments...
)
```
For details on VAD, see [whisper.cpp's README](https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#voice-activity-detection-vad).
# Run the entire model: PCM -> log mel spectrogram -> encoder -> decoder -> text
# Not thread safe for same context
# Uses the specified decoding strategy to obtain the text.
#
# The second argument +samples+ must be an array of samples, respond to :length, or be a MemoryView of an array of float. It must be 32 bit float PCM audio data.
assert_equal"& so my fellow Americans --> ask not what your country can do for you <-- ask what you can do for your country.\n",@whisper.to_srt.lines[2]
end
deftest_to_webvtt_escape
assert_equal"& so my fellow Americans --> ask not what your country can do for you <-- ask what you can do for your country.\n",@whisper.to_webvtt.lines[4]
fprintf(stderr," -vm FNAME, --vad-model FNAME [%-7s] VAD model path\n",params.vad_model.c_str());
fprintf(stderr," -vt N, --vad-threshold N [%-7.2f] VAD threshold for speech recognition\n",params.vad_threshold);
fprintf(stderr," -vspd N, --vad-min-speech-duration-ms N [%-7d] VAD min speech duration (0.0-1.0)\n",params.vad_min_speech_duration_ms);
fprintf(stderr," -vsd N, --vad-min-silence-duration-ms N [%-7d] VAD min silence duration (to split segments)\n",params.vad_min_silence_duration_ms);
fprintf(stderr," -vmsd N, --vad-max-speech-duration-s N [%-7s] VAD max speech duration (auto-split longer)\n",params.vad_max_speech_duration_s==FLT_MAX?
-vt N, --vad-threshold N [0.50 ] VAD threshold for speech recognition
-vspd N, --vad-min-speech-duration-ms N [250 ] VAD min speech duration (0.0-1.0)
-vsd N, --vad-min-silence-duration-ms N [100 ] VAD min silence duration (to split segments)
-vmsd N, --vad-max-speech-duration-s N [FLT_MAX] VAD max speech duration (auto-split longer)
-vp N, --vad-speech-pad-ms N [30 ] VAD speech padding (extend segments)
-vo N, --vad-samples-overlap N [0.10 ] VAD samples overlap (seconds between segments)
```
> [!WARNING]
@ -67,3 +87,35 @@ curl 127.0.0.1:8080/load \
-H "Content-Type: multipart/form-data" \
-F model="<path-to-model-file>"
```
## Load testing with k6
> **Note:** Install [k6](https://k6.io/docs/get-started/installation/) before running the benchmark script.
You can benchmark the Whisper server using the provided bench.js script with [k6](https://k6.io/). This script sends concurrent multipart requests to the /inference endpoint and is fully configurable via environment variables.
fprintf(stderr," -vm FNAME, --vad-model FNAME [%-7s] VAD model path\n",params.vad_model.c_str());
fprintf(stderr," -vt N, --vad-threshold N [%-7.2f] VAD threshold for speech recognition\n",params.vad_threshold);
fprintf(stderr," -vspd N, --vad-min-speech-duration-ms N [%-7d] VAD min speech duration (0.0-1.0)\n",params.vad_min_speech_duration_ms);
fprintf(stderr," -vsd N, --vad-min-silence-duration-ms N [%-7d] VAD min silence duration (to split segments)\n",params.vad_min_silence_duration_ms);
fprintf(stderr," -vmsd N, --vad-max-speech-duration-s N [%-7s] VAD max speech duration (auto-split longer)\n",params.vad_max_speech_duration_s==FLT_MAX?
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.