whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-04-27 22:39:52 +00:00

Author	SHA1	Message	Date
Jeff Bolz	76231bda56	vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (llama/12630) There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.	2025-04-24 20:39:16 +03:00
Jeff Bolz	b243416918	vulkan: Implement split_k for coopmat2 flash attention. (llama/12627) When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.	2025-04-24 20:39:16 +03:00
Jeff Bolz	2105b110d3	vulkan: Implement grouped query attention in the coopmat2 FA shader (llama/12559) When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.	2025-04-24 20:39:16 +03:00
Wagner Bruna	801d6bd809	vulkan: fix build when glslc doesn't support coopmat (llama/12683)	2025-04-02 15:51:57 +03:00
0cc4m	0810f02547	Vulkan: Add DP4A MMQ and Q8_1 quantization shader (llama/12135) * Vulkan: Add DP4A MMQ and Q8_1 quantization shader * Add q4_0 x q8_1 matrix matrix multiplication support * Vulkan: Add int8 coopmat MMQ support * Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code * Add GL_EXT_integer_dot_product check * Remove ggml changes, fix mmq pipeline picker * Remove ggml changes, restore Intel coopmat behaviour * Fix glsl compile attempt when integer vec dot is not supported * Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq * Remove redundant comment * Fix integer dot check * Fix compile issue with unsupported int dot glslc * Update Windows build Vulkan SDK version	2025-04-02 15:51:57 +03:00
Georgi Gerganov	27533e7f63	metal : improve FA + improve MoE (llama/12612) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci	2025-03-28 21:47:42 +02:00
Jeff Bolz	cbb88c4050	vulkan: Optimize mul_mat_vec p021 and nc shaders (llama/12505) * tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.	2025-03-27 11:06:03 +02:00
stduhpf	13455c0b5f	Vulkan: RTE rounding for cpy to quant (llama/12480) * Vulkan: RTE rounding for cpy to quant Co-Authored-By: Jeff Bolz <jbolz@nvidia.com> * remove trailing whitespace * avoid duplicating pipeline_cpy_f32_quant * fix copypasting issue * remove duplicated code --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-03-27 11:06:03 +02:00
Jeff Bolz	102af79f63	vulkan: Submit once enough matmul work has been recorded (llama/12406) I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.	2025-03-27 11:06:03 +02:00
0cc4m	fa72479cfb	Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (llama/12434)	2025-03-27 11:06:03 +02:00
Molly Sophia	52c4c03b0a	llama: Add support for RWKV v7 architecture (llama/12412) * ggml: Add op l2_norm Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code-format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * fix MUSA build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-03-27 11:06:03 +02:00
Jeff Bolz	b3f3779c1b	vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (llama/12312)	2025-03-27 11:06:03 +02:00
Daniele	13eeebb1b2	vulkan: subgroup size tuning (llama/12087) * vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-03-27 11:06:03 +02:00
Jeff Bolz	2cd3061a23	vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (llama/12273) * vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking	2025-03-27 11:06:03 +02:00
Jeff Bolz	88d59e21b2	vulkan: Adjust coopmat2 tile sizes and selection heuristic (llama/12258)	2025-03-27 11:06:03 +02:00
Georgi Gerganov	54a54faee4	vulkan : sync (llama/0) ggml-ci	2025-03-08 15:13:01 +02:00
William Tambellini	c98681e6d5	ggml : upgrade init_tensor API to return a ggml_status (llama/11854) * Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-03-08 15:13:01 +02:00
Rémy O	3bab804981	vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (llama/11595) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants	2025-03-08 15:13:01 +02:00
Jeff Bolz	a0f76b2da7	vulkan: fix assertion when qy_needs_dequant (llama/12068) Looks like a copy/paste bug from qx_needs_dequant.	2025-03-08 15:13:01 +02:00
cmdr2	6ac8e6b2ce	cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129) * cuda: restrict SILU_BACK to fp32, since fp16 exceeds the desired test threshold * vulkan: specify fp32-only support for certain ops (that are now tested for fp16 as well) * f32 sigmoid in vulkan supports op * Revert "f32 sigmoid in vulkan supports op" This reverts commit c6f04b3c19bf4504c2776149c6d8cd84e0b48acb.	2025-03-08 15:13:01 +02:00
Rémy O	37a21dd43d	vulkan: implement several ops relevant for ggml_opt (llama/11769) * vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command	2025-02-27 08:55:36 +02:00
Jeff Bolz	8a22a8b17f	vulkan: support multi/vision rope, and noncontiguous rope (llama/11902)	2025-02-27 08:55:36 +02:00
Rémy O	1689aaf854	vulkan: initial support for IQ1_S and IQ1_M quantizations (llama/11528) * vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem	2025-02-27 08:55:36 +02:00
Eve	e22d69839d	vulkan: linux builds + small subgroup size fixes (llama/11767) * mm subgroup size * upload vulkan x86 builds	2025-02-27 08:55:36 +02:00
Danny Milosavljevic	db6e19188a	vulkan: Make Vulkan optional at runtime (ggml/11493). (llama/11494) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-02-27 08:55:36 +02:00
Wagner Bruna	b4b063a5c9	vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (llama/11592)	2025-02-27 08:55:36 +02:00
Jeff Bolz	930b739e7a	vulkan: account for lookup tables when checking shared memory size (llama/11502)	2025-02-27 08:55:36 +02:00
Jeff Bolz	be83f342fb	vulkan: print shared memory size (llama/11719)	2025-02-27 08:55:36 +02:00
Rémy O	6f08b24146	vulkan: initial support for IQ4_XS quantization (llama/11501)	2025-02-27 08:55:36 +02:00
Jeff Bolz	7c165d7fa8	vulkan: use smaller combined allocations to avoid fragmentation (llama/11551)	2025-02-27 08:55:36 +02:00
Johannes Gäßler	bae6bbf487	CUDA: non-contiguous (RMS) norm support (llama/11659) * CUDA: non-contiguous (RMS) norm support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-27 08:55:36 +02:00
Rémy Oudompheng	80fa576254	vulkan: implement initial support for IQ2 and IQ3 quantizations (llama/11360) * vulkan: initial support for IQ3_S * vulkan: initial support for IQ3_XXS * vulkan: initial support for IQ2_XXS * vulkan: initial support for IQ2_XS * vulkan: optimize Q3_K by removing branches * vulkan: implement dequantize variants for coopmat2 * vulkan: initial support for IQ2_S * vulkan: vertically realign code * port failing dequant callbacks from mul_mm * Fix array length mismatches * vulkan: avoid using workgroup size before it is referenced * tests: increase timeout for Vulkan llvmpipe backend --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-02-03 22:00:57 +02:00
Jeff Bolz	75e7d0585e	vulkan: Catch pipeline creation failure and print an error message (llama/11436) * vulkan: Catch pipeline creation failure and print an error message Also, fix some warnings from my on-demand compile change. * vulkan: fix pipeline creation logging	2025-02-03 22:00:57 +02:00
Jeff Bolz	7230a6e1c8	vulkan: compile shaders on-demand (llama/11406) Reduce first-run startup time and memory consumption. Should fix #11339.	2025-02-03 22:00:57 +02:00
amd-dwang	16eeb31933	Vulkan-run-test: fix mmq_wg_denoms (llama/11343) There should be a copy-and-paste error here. mmq_wg_denoms should be used together with warptile_mmq, instead of wg_denoms.	2025-02-03 22:00:57 +02:00
Jeff Bolz	3736706139	vulkan: fix diag_mask_inf (llama/11323) With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline.	2025-02-03 22:00:57 +02:00
Jeff Bolz	0dcada42d4	vulkan: fix coopmat2 validation failures (llama/11284) mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16. coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.	2025-02-03 22:00:57 +02:00
Jeff Bolz	668306ff2b	vulkan: fix coopmat2 flash attention for non-contiguous inputs (llama/11281) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268.	2025-02-03 22:00:57 +02:00
Jeff Bolz	7183a1eb72	vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (llama/11166) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination	2025-02-03 22:00:57 +02:00
0cc4m	cdb8aa2f2e	Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (llama/11161) * Vulkan: Remove float16 use in shaders * Fix validation error about subgroup_size_control extension	2025-01-14 10:38:01 +02:00
Molly Sophia	06209f6683	llama: add support for QRWKV6 model architecture (llama/11001) llama: add support for QRWKV6 model architecture (llama/11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix some typos Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix cuda warning Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update README.md Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: compilade <git@compilade.net> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2025-01-14 10:38:01 +02:00
Mathieu Baudier	b08c3a88c8	Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (llama/11117) * Disable GL_KHR_cooperative_matrix Vulkan extension if not available. * Perform Vulkan extensions checks in a more sensible order * Remove unnecessary #ifdef directive	2025-01-14 10:38:01 +02:00
0cc4m	5377099524	Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver (llama/11074) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check	2025-01-14 10:38:01 +02:00
Jeff Bolz	cea5f1c52f	vulkan: optimize mul_mat for small values of N (llama/10991) Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base.	2025-01-04 10:45:01 +02:00
Jeff Bolz	2112462db4	vulkan: im2col and matmul optimizations for stable diffusion (llama/10942) * tests: Add im2col perf tests * vulkan: optimize im2col, more elements per thread * vulkan: increase small tile size for NV_coopmat2 * vulkan: change im2col to 512 elements per workgroup	2025-01-04 10:45:01 +02:00
Jeff Bolz	fc84ecd445	vulkan: Use push constant offset to handle misaligned descriptors (llama/10987)	2025-01-04 10:45:01 +02:00
Eve	8de1e99907	vulkan: multi-row k quants (llama/10846) * multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default	2025-01-04 10:45:01 +02:00
Jeff Bolz	a4bb983190	vulkan: build fixes for 32b (llama/10927) * vulkan: build fixes for 32b Should fix #10923 * vulkan: initialize some buffer/offset variables	2025-01-04 10:45:01 +02:00
Eve	d420a759c5	vulkan: bugfixes for small subgroup size systems + llvmpipe test (llama/10809) * ensure mul mat shaders work on systems with subgroup size less than 32 more fixes add test * only s_warptile_mmq needs to be run with 32 threads or more	2024-12-18 12:52:16 +02:00
Zhiyuan Li	a1ab9b5e91	rwkv6: add wkv6 support for Vulkan backend (llama/10829) * rwkv_wkv6 vulkan shader * RWKV_WKV6 Vulkan op tests passed Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * add [[unroll]] and remove unnecessary conditions * add uma support * fix erros in EditorConfig Checker --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Molly Sophia <mollysophia379@gmail.com>	2024-12-18 12:52:16 +02:00

1 2

73 Commits