whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-05-31 14:30:42 +00:00

Author	SHA1	Message	Date
Jeff Bolz	3c26dd3353	vulkan: fix NaN issue in flash attention shader (llama/12776) Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.	2025-04-24 20:39:16 +03:00
Jeff Bolz	d792d2a2dc	vulkan: Use unclamped loads for flash attention mask (llama/12720) nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.	2025-04-24 20:39:16 +03:00
0cc4m	8add58aa5e	Vulkan: Tune Vulkan mmq int dot shader for performance (llama/12767)	2025-04-24 20:39:16 +03:00
Nicolò Scipione	8f8ede1b12	sycl: allow ggml-sycl configuration and compilation using Visual Studio project/solution (llama/12625)	2025-04-24 20:39:16 +03:00
Ronny Brendel	3a6fe8d767	cmake: fix ggml-shaders-gen compiler paths containing spaces (llama/12747) fixes error for compiler paths with spaces	2025-04-24 20:39:16 +03:00
Jeff Bolz	76231bda56	vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (llama/12630) There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.	2025-04-24 20:39:16 +03:00
Jeff Bolz	785437c253	vulkan: set cmake minimum and project name in vulkan-shaders (llama/12744)	2025-04-24 20:39:16 +03:00
Gaurav Garg	2f0612cb1c	CUDA: Prefer vector flash decoding kernel for Gemma models (llama/12738) * Prefer vector flash decoding kernel for Gemma models Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models. * Update ggml/src/ggml-cuda/fattn.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-04-24 20:39:16 +03:00
Jeff Bolz	e944065d5b	vulkan: Fix missing cmake logic for dot product extension (llama/12721)	2025-04-24 20:39:16 +03:00
a3sh	ccc7b5df0b	fix MUSA compiler warning (llama/12704) * fix MUSA compiler warning * replace (void) with GGML_UNUSED	2025-04-24 20:39:16 +03:00
Chenguang Li	fbed36851e	CANN: Support operator SIN COS ARGMAX (llama/12709) * [CANN]support sin cos argmax Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]codestyle adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]Remove redundant code Signed-off-by: noemotiovon <noemotiovon@gmail.com> --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: noemotiovon <noemotiovon@gmail.com>	2025-04-24 20:39:16 +03:00
Alan Gray	d1d847f184	Simplify and improve CUDA graphs through use of indirect copy pointers (llama/9017) * CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates. Fixes #12152 * Addressed comments * fix HIP builds * properly sync to stream * removed ggml_cuda_cpy_fn_ptrs * move stream sync before free * guard to only use indirection with graphs * style fixes * check for errors --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-04-24 20:39:16 +03:00
hipudding	337f91d4a6	CANN: Fix failed test cases (llama/12708) * CANN: Fix memory waste in aclnn_tensor * CANN: fix backend ops fail * CANN: fix acl_tensor memory alloc. * CANN: format * CANN: remove trailing whitespace	2025-04-24 20:39:16 +03:00
lhez	317a0031f9	opencl: use `max_alloc_size` in backend ctx instead of querying again (llama/12705)	2025-04-24 20:39:16 +03:00
Jeff Bolz	b243416918	vulkan: Implement split_k for coopmat2 flash attention. (llama/12627) When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.	2025-04-24 20:39:16 +03:00
bandoti	6e532c7187	cmake: remove caching from vulkan coopmat checks (llama/12719)	2025-04-24 20:39:16 +03:00
Jeff Bolz	2105b110d3	vulkan: Implement grouped query attention in the coopmat2 FA shader (llama/12559) When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.	2025-04-24 20:39:16 +03:00
0cc4m	f82622180f	Vulkan: Fix mmq int dot float cache size (llama/12722)	2025-04-24 20:39:16 +03:00
Diego Devesa	a71c64512a	llama : add option to override model tensor buffers (llama/11397) * llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes	2025-04-24 20:39:16 +03:00
Georgi Gerganov	1e9c2f87f1	ggml : simplify Arm fp16 CPU logic (ggml/1177) * ggml : simlpify Arm fp16 CPU logic ggml-ci * cont : bring back CUDA/MUSA checks ggml-ci	2025-04-24 20:39:16 +03:00
Sigbjørn Skjæret	06ce8f83e6	CUDA: don't convert BF16 weights to FP32 (ggml/1174) * add bf16 support * use convert_from_bf16_cuda instead of convert_unary_cuda for f32 * revert 7ec5085 * move functionality into convert_unary with constexpr	2025-04-24 20:39:16 +03:00
cmdr2	513ecf8dc0	cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) * cpu: refactor SIMD mappings and vectorized op functions into separate files * Fix warning for ggml_float to float * Fix warnings * cpu: move all the operations (except mul_mat) to a separate c++ file * fix whitespace * Update ggml/src/ggml-cpu/vec.h Co-authored-by: Diego Devesa <slarengh@gmail.com> * Fix PR comments - use GGML_UNUSED, use cassert in ops.cpp * Reverse the order of import for ops.h and vec.h, to match what was present in ggml-cpu.c previously --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-04-03 10:30:16 +03:00
Chenguang Li	d7a9346ab1	get_rows and dup optimization (llama/12671) * [CANN]get_rows and dup optimization. Co-authored-by: hipudding <huafengchun@gmail.com> Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]GET_ROWS and CPY/DUP optimization Co-authored-by: hipudding <huafengchun@gmail.com> Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-04-02 15:51:57 +03:00
Junil Kim	b63d23f728	opencl : fix memory allocation size (llama/12649) issue: https://github.com/CodeLinaro/llama.cpp/pull/17#issuecomment-2760611283 This patch fixes the memory allocation size not exceeding the maximum size of the OpenCL device.	2025-04-02 15:51:57 +03:00
Georgi Gerganov	f6ce10e4a1	metal : use F32 prec in FA kernels (llama/12688) * metal : use F32 prec in FA kernels ggml-ci * cont : fix FA vec kernel ggml-ci	2025-04-02 15:51:57 +03:00
R0CKSTAR	6cb2b86581	Fix clang warning in gguf_check_reserved_keys (llama/12686) * Fix clang warning in gguf_check_reserved_keys Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix typo Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-04-02 15:51:57 +03:00
Wagner Bruna	801d6bd809	vulkan: fix build when glslc doesn't support coopmat (llama/12683)	2025-04-02 15:51:57 +03:00
Romain Biessy	ddf7e6a15d	SYCL: Rename oneMKL to oneMath (llama/12192) * Rename oneMKL Interface to oneMath * Use oneMath for Intel vendor * Rename occurences to mkl * clang-format * Silence verbose warnings * Set oneMath HIP_TARGETS * Fix silence warnings * Remove step to build oneMath from build instructions * Use fixed oneMath version * Remove INTEL_CPU * Fold CMake oneDNN conditions * Use Intel oneMKL for Intel devices * Improve CMake message * Link against MKL::MKL_SYCL::BLAS only * Move oneMath documentation to Nvidia and AMD sections	2025-04-02 15:51:57 +03:00
Akarshan Biswas	0d42097fd3	SYCL: switch to SYCL namespace (llama/12674)	2025-04-02 15:51:57 +03:00
a3sh	842b9c984c	ggml : faster ssm scan (llama/10558) * faster ssm_scan * delete unused commnet * clang format * add space * modify unnecessary calculations * faster ssm conv implementatioin * modify file name with dash	2025-04-02 15:51:57 +03:00
0cc4m	0810f02547	Vulkan: Add DP4A MMQ and Q8_1 quantization shader (llama/12135) * Vulkan: Add DP4A MMQ and Q8_1 quantization shader * Add q4_0 x q8_1 matrix matrix multiplication support * Vulkan: Add int8 coopmat MMQ support * Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code * Add GL_EXT_integer_dot_product check * Remove ggml changes, fix mmq pipeline picker * Remove ggml changes, restore Intel coopmat behaviour * Fix glsl compile attempt when integer vec dot is not supported * Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq * Remove redundant comment * Fix integer dot check * Fix compile issue with unsupported int dot glslc * Update Windows build Vulkan SDK version	2025-04-02 15:51:57 +03:00
Georgi Gerganov	8c13c78f9d	cmake : fix whitespace (llama/0)	2025-04-02 15:51:57 +03:00
Akarshan Biswas	2e2f0f954b	SYCL: Remove misleading ggml_sycl_op_flatten function (llama/12387) * SYCL: Remove misleading ggml_sycl_op_flatten function * remove trailing whitespace * Fix L2 norm from rebase * remove try catch block from element_wise.cpp * remove comment from common.hp * ggml-sycl.cpp: Add try catch sycl::exception block in compute_forward * norm.cpp: remove try catch exception block	2025-03-31 14:56:53 +03:00
Georgi Gerganov	93631b2be6	metal : use constexpr in FA kernels + fix typedef (llama/12659) * metal : use constexpr in FA kernels ggml-ci * cont ggml-ci * cont : fix typedef ggml-ci	2025-03-31 14:56:53 +03:00
R0CKSTAR	f9015b585b	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (llama/12611) * musa: fix all warnings Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: update ci doc (install ccache) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * fix Windows build issue Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-31 14:56:53 +03:00
Jay	1880ffd7ff	cmake : fix ccache conflict (llama/12522) If users already set CMAKE_C_COMPILER_LAUNCHER globally, setting it in cmake again will lead to conflict and compile fail. Signed-off-by: Jay <BusyJay@users.noreply.github.com>	2025-03-31 14:56:53 +03:00
Xuan-Son Nguyen	9173932c78	cpu : rm unused variable (ggml/1166)	2025-03-31 14:56:53 +03:00
cmdr2	94c3f3877f	cpu: de-duplicate some of the operators and refactor (ggml/1144) * cpu: de-duplicate some of the operators and refactor * Fix PR comments * Fix PR comments	2025-03-31 14:56:53 +03:00
Sandro Hanea	00086469fb	cmake: improve Vulkan cooperative matrix support checks (#2966 ) Co-authored-by: Sandro Hanea <me@sandro.rocks>	2025-03-31 13:44:36 +03:00
Georgi Gerganov	27533e7f63	metal : improve FA + improve MoE (llama/12612) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci	2025-03-28 21:47:42 +02:00
Icenowy Zheng	1b81415963	vulkan: fix coopmat shader generation when cross-compiling (llama/12272) * vulkan: fix coopmat shader generation when cross-compiling Previously the status of coopmat{,2} support isn't passed to the vulkan-shaders-gen project building on the host, which leads to build failure because of the cross-compiling code expecting coopmat{,2} shaders that didn't get generated. Fix this by passing the coopmat{,2} support status to vulkan-shaders subproject. Signed-off-by: Icenowy Zheng <uwu@icenowy.me> * Only call coop-mat shaders once * Fix whitespace --------- Signed-off-by: Icenowy Zheng <uwu@icenowy.me> Co-authored-by: bandoti <141645996+bandoti@users.noreply.github.com>	2025-03-28 21:47:42 +02:00
amritahs-ibm	0001ec075f	llamafile : ppc64le GEMV forwarding for FP32. (llama/12594) This patch enables usage of MMA when one of the dimensions of the matrix(ie either M or N) is 1. This is useful in case of token generation where N < 2. The concept of 'GEMV Forwarding' is used where when one of the matrix has a single row/column, the elements are broadcasted, instead of using packing routine to prepack the matrix elements. This change results in 5% - 15% improvement in total speed(ie all tokens/total time), across various batch sizes. This is in comparision with the corresponding dot product implementation. The patch is tested with FP32 models of Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>	2025-03-28 21:47:42 +02:00
Radoslav Gerganov	5bad2e5099	rpc : send hash when tensor data is above some fixed threshold (llama/12496) * rpc : send hash when tensor data is above some fixed threshold ref #10095 * rpc : put cache under $HOME/.cache/llama.cpp * try to fix win32 build * another try to fix win32 build * remove llama as dependency	2025-03-28 21:47:42 +02:00
lhez	6fc0ae2f5a	opencl: add multi and vision rope, `gelu_quick` and `im2col` (llama/12600) * opencl: add `im2col` * opencl: add `gelu_quick` * opencl: add mrope * opencl: add vision rope	2025-03-28 21:47:42 +02:00
Georgi Gerganov	1fbdfb1d36	files : remove old wkv6 (#0 ) ggml-ci	2025-03-27 11:06:03 +02:00
Georgi Gerganov	8ca67df291	ggml : sync/merge cmake,riscv,powerpc, add common.cmake (ggml/0)	2025-03-27 11:06:03 +02:00
amritahs-ibm	fc6d343e76	llamafile : ppc64le MMA implementation for Q4_0. (llama/12489) This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le ISA using MMA builtins. This patch handles matrix multiplication between quantised datatypes, block_q4_0 and block_q8_0. This change results in 5% - 50% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>	2025-03-27 11:06:03 +02:00
Akarshan Biswas	3199356d3a	SYCL: implement memset ggml backend buffer interface (llama/12580) * SYCL: implement memset ggml backend buffer interface * use GGML_ABORT macro * Do not wait for all queues to finish for memset operation	2025-03-27 11:06:03 +02:00
Slobodan Josic	e0c43b0bbf	HIP: Add support for RDNA4 targets (llama/12372)	2025-03-27 11:06:03 +02:00
Georgi Gerganov	f4f619ea8e	metal : refactor mat-vec code (llama/12569) * metal : refactor mat-vec code ggml-ci * metal : rename all_sum -> sum_all ggml-ci * metal : fix comments [no ci] * metal : fix nr constant [no ci] * metal : mv q6_K support nr0 > 1 ggml-ci * metal : reduce register pressure ggml-ci * metal : fix typo [no ci] * metal : reduce register pressure ggml-ci	2025-03-27 11:06:03 +02:00

1 2 3 4 5 ...

707 Commits