whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-06-24 01:03:25 +00:00

Author	SHA1	Message	Date
Akarshan Biswas	489dc158a6	SYCL: Implement few same quantized type copy kernels (llama/13739) * SYCL: Implement few same quantized type copy kernels * Use memcpy for copying contiguous tensors ggml-ci * feat(sycl): add contiguous tensor copy support and device checks Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance. * refactor: replace specific block copy functions with template The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed. * Exclude BF16 support for COPY tensors for now ggml-ci * perf: adjust SYCL copy kernel block sizes for efficiency Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.	2025-06-10 12:40:33 +03:00
Masato Nakasaka	f0f5a9f7fb	vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (llama/14001) * allowing B580 and U9-288V * experimenting code to detect Xe2 * allowing coopmat only for Xe2 GPUs * fixed comment wording * fixed comment wording * removed unnecessary driver check	2025-06-10 12:40:33 +03:00
Diego Devesa	13a03c5d33	llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (llama/14013)	2025-06-10 12:40:33 +03:00
Jeff Bolz	6dd91d4f7e	vulkan: automatically deduce size of push constants (llama/13936)	2025-06-10 12:40:33 +03:00
Ervin Áron Tasnádi	5171b24f70	ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (llama/13813) * * ggml-vulkan: adds op CONV_TRANSPOSE_1D * test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D * Missing barrier added to shader. Number of additional tests reduced to 108. * * Fixes typo in variable name. * Removes extra whitespaces. * Adds int64->int32 casts to prevent possible warnings. * Problem size reduced in tests to pass tests with llvmpipe. * supports_op condition moved from unintended position	2025-06-10 12:40:33 +03:00
Diego Devesa	23e2fe0682	releases : use dl backend for linux release, remove arm64 linux release (llama/13996)	2025-06-10 12:40:33 +03:00
Johannes Gäßler	7f4d110f53	CUDA: fix FTZ in FA for Gemma 3 (llama/13991)	2025-06-10 12:40:33 +03:00
Jeff Bolz	ee0ef39fee	vulkan: fix warnings in perf logger querypool code (llama/13937)	2025-06-10 12:40:33 +03:00
lhez	62791ba2e6	opencl: add `backend_synchronize` (llama/13939) * This is not needed by the normal use where the result is read using `tensor_get`, but it allows perf mode of `test-backend-ops` to properly measure performance.	2025-06-10 12:40:33 +03:00
rmatif	e16ef08884	OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (llama/13840) * add concat, pad, repeat, tsembd, tanh, upscale * small fixes	2025-06-10 12:40:33 +03:00
Georgi Gerganov	c72d3ce935	metal : use F32 accumulators in FA kernels (llama/13975) ggml-ci	2025-06-10 12:40:33 +03:00
shalinib-ibm	126aeb4a49	cmake : Handle mixed-case 'Power' strings in POWER CPU detection (llama/13966) Some systems report the CPU implementation as "Power11" instead of "POWER11". The existing CMake logic uses a case-sensitive regular expression to extract the CPU generation, which fails when the casing doesn't exactly match "POWER". This patch provides a fix by first converting the string to uppercase before applying the regex. Signed-off-by: root <root@rheldb2v.pperf.tadn.ibm.com> Co-authored-by: root <root@rheldb2v.pperf.tadn.ibm.com>	2025-06-10 12:40:33 +03:00
Atharva Dubey	ef2a79d2b8	sycl: quantize and reorder the input to q8_1 when reorder is enabled (llama/13826) * [WIP]: fuse q8 quantization and reorder * wip2: fuse q8 quantization and reorder * working q8 reorder commit * restored common.hpp * remove debug prints * remove unnecessary headers and remove trailing whitespace * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com> --------- Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com>	2025-06-10 12:40:33 +03:00
Johannes Gäßler	9589645e72	gguf: fix failure on version == 0 (llama/13956)	2025-06-10 12:40:33 +03:00
Aaron Teo	20f913d119	ggml: check if non-native endian model is being loaded (llama/13943) * gguf: prevent non-native endian models from being loaded Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * gguf: update error message Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * gguf: make the non-native endian check more verbose Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move ggml_assert location Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: reword the endianness check error message Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-06-10 12:40:33 +03:00
Kai Pastor	b933d17c30	Add in-build ggml::ggml ALIAS library (ggml/1260) Enable uniform linking with subproject and with find_package.	2025-06-10 12:40:33 +03:00
Max Krasnyansky	1e16340f4b	threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (llama/12995) * threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling We talked about adding LOW priority for GGML threads in the original threadpool PR. It might be useful for some cases to avoid contention. Latest Windows ARM64 releases started parking (offlining) the CPU cores more aggresively which results in suboptimal performance with n_threads > 4. To deal with that we now disable Power Throttling for our threads for the NORMAL and higher priorities. Co-authored-by: Diego Devesa <slarengh@gmail.com> * threading: disable SetThreadInfo() calls for older Windows versions * Update tools/llama-bench/llama-bench.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-06-01 15:14:44 +03:00
Shawn yang	4a50254998	CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856 ) (llama/13895) * 1. add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu 2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted code indentation Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Fixed incorrect setting of variable types Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted the judgment logic Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()' * Update ggml/src/ggml-cuda/ggml-cuda.cu Add a defensive security assert Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted the support judgment logic. Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * revoke the suggest commit changes due to it's not applicable in jetson_device * Update ggml/src/ggml-cuda/ggml-cuda.cu Add parentheses to enforce operator precedence Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update ggml/src/ggml-cuda/ggml-cuda.cu Fix ci bug: add a spaces Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: yangxiao <yang_xl@tju.edu.cn> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: yangxiao <yangxl_zz@qq.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-06-01 15:14:44 +03:00
Johannes Gäßler	a5aff28198	CUDA: fix typo in FlashAttention code (llama/13926)	2025-06-01 15:14:44 +03:00
Diego Devesa	6c0472ab8f	sched : avoid changing cur_copy when a graph is already allocated (llama/13922)	2025-06-01 15:14:44 +03:00
Diego Devesa	b14cee184a	cuda : prevent using split buffers with 3d/4d matrices (llama/13919)	2025-06-01 15:14:44 +03:00
Akarshan Biswas	f7f92d0aab	SYCL: Add mrope kernel (llama/13755) * SYCL: Add mrope kernel * feat: Optimize rope operations with vectorization Uses `sycl::vec` to load and store two elements at a time, significantly improving performance in `rope_norm`, `rope_neox`, and `rope_multi`. This reduces the number of memory accesses and leverages SIMD instructions for faster execution. * Use ceil_div	2025-06-01 15:14:44 +03:00
Christian Kastner	1893359cfd	cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (llama/13890)	2025-06-01 15:14:44 +03:00
Yibo Cai	ea643c6ae3	arm64: optimize q4_k_q8_k kernel with i8mm (llama/13886) This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q4_k_m quantization model. - 34% ~ 50% S_PP uplift for all batch sizes - 12% ~ 37% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 110.12 \| 147.83 \| 24.36 \| 24.28 \| \| 128 \| 128 \| 2 \| 121.16 \| 172.42 \| 46.36 \| 47.93 \| \| 128 \| 128 \| 4 \| 120.15 \| 169.75 \| 74.68 \| 84.00 \| \| 128 \| 128 \| 8 \| 130.97 \| 196.81 \| 91.04 \| 114.74 \| \| 128 \| 128 \| 16 \| 131.01 \| 196.88 \| 101.43 \| 135.79 \| \| 128 \| 128 \| 32 \| 130.85 \| 196.51 \| 106.97 \| 147.29 \| --------------------------------------------------------------------- ```	2025-06-01 15:14:44 +03:00
Christian Kastner	1d7b3c79f4	cmake: Factor out CPU architecture detection (llama/13883) * cmake: Define function for querying architecture The tests and results match exactly those of src/CMakeLists.txt * Switch arch detection over to new function	2025-06-01 15:14:44 +03:00
Vineel Abhinav	ccfaac2bb0	ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (llama/13882) * F32-Mamba-Seq_Scan-SVE * Fix formatting * ggml : missing space --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-01 15:14:44 +03:00
Vineel Abhinav	1230d37bca	ggml: aarch64: Implement SVE F32 kernels for vector functions (llama/13843) * F32-Mamba-SVE * F32-Mamba-SVE * Resolve test errors-1 * Resolve test errors-2 * F32-vec-SVE * F32-vec-SVE * F32-vec-SVE	2025-06-01 15:14:44 +03:00
Johannes Gäßler	9a500394ad	CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)	2025-06-01 15:14:44 +03:00
leo-pony	0035b8527c	CANN: Add SOC TYPE printing in cmake configuration (llama/13837)	2025-06-01 15:14:44 +03:00
lhez	3623186312	opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (llama/13787) * opencl: add `argsort` * opencl: add `div` * opencl: add `add_rows` * opencl: add `sub` * opencl: add `sigmoid`, both `f16` and `f32` * opencl: add `group_norm`	2025-06-01 15:14:44 +03:00
lhez	67beac47f3	opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (llama/13790)	2025-06-01 15:14:44 +03:00
Jeff Bolz	47a19bae25	vulkan: use timestamp queries for GGML_VULKAN_PERF (llama/13817) Also change it to be controlled by an env var rather than cmake flag	2025-06-01 15:14:44 +03:00
Akarshan Biswas	3d5c7ca4bc	SYCL: add gelu_erf kernel (llama/13749) * SYCL: add gelu_erf kernel * refactor code Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com> * Use scope_op_debug_print --------- Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>	2025-06-01 15:14:44 +03:00
Xuan-Son Nguyen	4dfb2c2215	ggml : add ggml_repeat_4d (llama/13824)	2025-06-01 15:14:44 +03:00
Kai Pastor	ad433403ce	vulkan : Remove unexpected ; (ggml/1253)	2025-06-01 15:14:44 +03:00
Kai Pastor	4064dd6484	cmake : Fix broken CMake error messages (ggml/1252)	2025-06-01 15:14:44 +03:00
Radoslav Gerganov	fd75c4995b	ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247) The implementation is already deleted with commit 9d0762e. closes: #1235	2025-06-01 15:14:44 +03:00
Daniel Tang	4d18e52f55	ggml : Fix backtrace breaking Windows build (#3203 ) Some checks failed CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Has been cancelled Details CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled Details CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled Details CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled Details CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Has been cancelled Details CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Has been cancelled Details CI / windows-blas (Win32, ON, x86, 0.3.29, Release, x86, 2.28.5, ON) (push) Has been cancelled Details CI / windows-blas (x64, ON, x64_64, 0.3.29, Release, x64, 2.28.5, ON) (push) Has been cancelled Details CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Has been cancelled Details CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Has been cancelled Details CI / emscripten (Release) (push) Has been cancelled Details CI / ios-xcode-build (Release) (push) Has been cancelled Details CI / android (push) Has been cancelled Details CI / android_java (push) Has been cancelled Details CI / bindings-java (push) Has been cancelled Details CI / quantize (push) Has been cancelled Details CI / release (push) Has been cancelled Details CI / coreml-base-en (push) Has been cancelled Details CI / vad (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main-musa.Dockerfile platform:linux/amd64 tag:main-musa]) (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Has been cancelled Details Examples WASM / deploy-wasm-github-pages (push) Has been cancelled Details	2025-05-29 13:26:58 +03:00
Radoslav Gerganov	48dddbbac1	ggml : install dynamic backends (ggml/1240)	2025-05-29 09:56:26 +03:00
Daniel Tang	5ea2c37a4c	ggml : Print backtrace on uncaught C++ exceptions (ggml/1232) The goal is to have what users call "full logs" contain the backtrace. This is registered upon ggml_init. Also fixes a minor fd leak on Linux.	2025-05-29 09:56:26 +03:00
Simon Booth	5720426d97	whisper : install shared libs when using GGML_BACKEND_DL (#3195 )	2025-05-28 10:15:04 +02:00
xctan	15ae9dc2a4	ggml : riscv: add xtheadvector support (llama/13720) * ggml : riscv: add xtheadvector support * ggml : clean up some macro usage	2025-05-27 18:03:00 +03:00
Christian Kastner	2e7a1e3e43	ggml-cpu: x86 feature detection is specific to x86 (llama/13811)	2025-05-27 18:03:00 +03:00
Diego Devesa	b75babebb2	ggml : allow CUDA graphs when using pipeline parallelism (llama/13814)	2025-05-27 18:03:00 +03:00
Georgi Gerganov	cc7a0105ef	cuda : avoid cuGetErrorString (llama/13791) ggml-ci	2025-05-27 18:03:00 +03:00
Akarshan Biswas	195fde8804	SYCL: Add non contiguous support in RMS_NORM and NORM kernels (llama/13611) * SYCL: Add non contiguous input support to norm kernel * refactor and add RMS_NORM non contiguous input support ggml-ci * restore subgroup reduction for multi-subgroup thread blocks in norm kernels * Swap grid dims of nsamples and nrows ggml-ci * Revert "Swap grid dims of nsamples and nrows" This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf. * restore not required changes ggml-ci * address review comments: change it to more like SYCL * Use a common function to calculate offset * remove wrap around logic for handling broadcasts * remove static from calculate_offset fn and use ceil_div	2025-05-27 18:03:00 +03:00
Romain Biessy	25e27904ca	sycl: Add more debug prints (llama/13640)	2025-05-27 18:03:00 +03:00
Jeff Bolz	474f7be8b6	vulkan: mark IM2COL as supporting non-contig (llama/13783)	2025-05-27 18:03:00 +03:00
Bizhao Shi	e35fecc2a1	CANN: Add the basic supports of Flash Attention kernel (llama/13627) * cann: add the basic FA support * cann: update the readme * cann: update the FlashAttention with PSEShift * cann: update the input parameters in FA * cann: update the alibi with max_bias * cann: add the constrints of softcap * cann: update the docs CANN.md * cann: update the docs CANN.md * cann: fix typo of CANN.md * cann: add some comments and update the CANN.md * cann: update the CANN.md * cann: update the inner precise for fusedInferAttention * cann: update the constraints of flash_attn_ext on ggml-cann.cpp * cann: clean the whitespace * cann: clean the whitespace * cann: add a new endline	2025-05-27 18:03:00 +03:00
Akarshan Biswas	1cd7028428	SYCL: revert "sycl: simplify bin_bcast_kernel (ggml/13383)" (llama/13752) Temporarily reverted due to failing fp16 DIV operation This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5. ggml-ci	2025-05-27 18:03:00 +03:00

1 2 3 4 5 ...

961 Commits