whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-05-30 22:14:12 +00:00

Author	SHA1	Message	Date
Johannes Gäßler	2ffdda99e8	CUDA: fix logic for clearing padding with -ngl 0 (llama/13320)	2025-05-07 21:00:32 +03:00
Johannes Gäßler	d052e64d42	CUDA: batched+noncont MMQ, refactor bs>1 MoE code (llama/13199)	2025-05-01 13:29:02 +03:00
Johannes Gäßler	1543a3600c	CUDA: fix non-cont. inputs for batched mat mul (llama/13155)	2025-05-01 13:29:02 +03:00
Johannes Gäßler	670bf02662	CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (llama/13137)	2025-05-01 13:29:02 +03:00
Johannes Gäßler	3d54b68ea7	CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (llama/13014) * CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID * fix logic for RoPE support, CUDA graphs	2025-04-24 20:39:16 +03:00
Georgi Gerganov	36019c35a3	graph : make FA compatible with MLA + add initial Metal kernels (llama/12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci	2025-04-24 20:39:16 +03:00
Alan Gray	4e936e2afa	ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (llama/12970)	2025-04-24 20:39:16 +03:00
David Huang	43e3d25d93	CUDA/HIP: Share the same unified memory allocation logic. (llama/12934) Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.	2025-04-24 20:39:16 +03:00
Alan Gray	5d33d3c929	ggml: disable CUDA graphs for unsupported DUP and CONT node types (llama/12891) Fixes #12798	2025-04-24 20:39:16 +03:00
Sigbjørn Skjæret	79f23d9132	cuda : add f32 to bf16 copy op (llama/12806) This allows BF16 KV-cache on CUDA.	2025-04-24 20:39:16 +03:00
Diego Devesa	b9c71fae5a	ggml : add bilinear upscale support (ggml/1185)	2025-04-24 20:39:16 +03:00
Alan Gray	d1d847f184	Simplify and improve CUDA graphs through use of indirect copy pointers (llama/9017) * CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates. Fixes #12152 * Addressed comments * fix HIP builds * properly sync to stream * removed ggml_cuda_cpy_fn_ptrs * move stream sync before free * guard to only use indirection with graphs * style fixes * check for errors --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-04-24 20:39:16 +03:00
Sigbjørn Skjæret	06ce8f83e6	CUDA: don't convert BF16 weights to FP32 (ggml/1174) * add bf16 support * use convert_from_bf16_cuda instead of convert_unary_cuda for f32 * revert 7ec5085 * move functionality into convert_unary with constexpr	2025-04-24 20:39:16 +03:00
a3sh	842b9c984c	ggml : faster ssm scan (llama/10558) * faster ssm_scan * delete unused commnet * clang format * add space * modify unnecessary calculations * faster ssm conv implementatioin * modify file name with dash	2025-04-02 15:51:57 +03:00
Georgi Gerganov	27533e7f63	metal : improve FA + improve MoE (llama/12612) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci	2025-03-28 21:47:42 +02:00
Slobodan Josic	e0c43b0bbf	HIP: Add support for RDNA4 targets (llama/12372)	2025-03-27 11:06:03 +02:00
R0CKSTAR	a219941812	CUDA: Fix clang warnings (llama/12540) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-27 11:06:03 +02:00
R0CKSTAR	d487a28ae1	musa: refine compute capability (llama/12493) * musa: refine compute capability Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-27 11:06:03 +02:00
Gaurav Garg	ae6a9bb9a5	CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183) - Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value. - Prefer vector flash attention kernels over MMA kernel for BS=1 Fixes Issue: #12182 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-03-27 11:06:03 +02:00
R0CKSTAR	31b62276cf	musa: override warp_size of musa device to 32 (llama/12445) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-27 11:06:03 +02:00
Molly Sophia	52c4c03b0a	llama: Add support for RWKV v7 architecture (llama/12412) * ggml: Add op l2_norm Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code-format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * fix MUSA build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-03-27 11:06:03 +02:00
Gaurav Garg	cfc2560e41	cuda : enable CUDA Graph on CUDA Toolkit < 12.x (llama/12394) * Enable CUDA Graph on CTK < 12.x `cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x. * Fix compilation errors with MUSA * Disable CUDA Graph for MUSA	2025-03-27 11:06:03 +02:00
uvos	b9eab73fa2	HIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it. (llama/12209) This avoids conflict with internal cuda/hip runtimes memory managment behavior.	2025-03-08 15:13:01 +02:00
William Tambellini	c98681e6d5	ggml : upgrade init_tensor API to return a ggml_status (llama/11854) * Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-03-08 15:13:01 +02:00
cmdr2	6ac8e6b2ce	cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129) * cuda: restrict SILU_BACK to fp32, since fp16 exceeds the desired test threshold * vulkan: specify fp32-only support for certain ops (that are now tested for fp16 as well) * f32 sigmoid in vulkan supports op * Revert "f32 sigmoid in vulkan supports op" This reverts commit c6f04b3c19bf4504c2776149c6d8cd84e0b48acb.	2025-03-08 15:13:01 +02:00
cmdr2	60d2ddebdf	cuda/cpu: Increase support for fp16 unary operations (ggml/1125) * Support fp16 unary operations in the CUDA backend * cpu: increase fp16 support for unary operators in the CPU backend * cuda: increase fp16 support for unary operators in the CUDA backend * Add test cases for fp16 unary operators * metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing * metal: fix PR comments for unary op support after fp16 unary tests	2025-03-08 15:13:01 +02:00
Johannes Gäßler	38ac47cd4d	CUDA: app option to compile without FlashAttention (llama/12025)	2025-02-27 08:55:36 +02:00
Gian-Carlo Pascutto	98dab49b9a	cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (llama/12000)	2025-02-27 08:55:36 +02:00
Bodhi	48f5e893f5	MUSA: support ARM64 and enable dp4a .etc (llama/11843) * MUSA: support ARM64 and enable __dp4a .etc * fix cross entropy loss op for musa * update * add cc info log for musa * add comment for the MUSA .cc calculation block --------- Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>	2025-02-27 08:55:36 +02:00
R0CKSTAR	4e07957bf9	musa: bump MUSA SDK version to rc3.1.1 (llama/11822) * musa: Update MUSA SDK version to rc3.1.1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: Remove workaround in PR #10042 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-02-27 08:55:36 +02:00
uvos	86729fcd6d	HIP: Switch to std::vector in rocblas version check (llama/11820)	2025-02-27 08:55:36 +02:00
Johannes Gäßler	556f773d53	CUDA: fix CUDART_VERSION checks (llama/11821)	2025-02-27 08:55:36 +02:00
Johannes Gäßler	1b67d72f87	CUDA: use arch list for compatibility check (llama/11775) * CUDA: use arch list for feature availability check --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-27 08:55:36 +02:00
Johannes Gäßler	01c9aafbfd	CUDA: support for mat. mul. with ne03 != ne13 (llama/11656)	2025-02-27 08:55:36 +02:00
Johannes Gäßler	bae6bbf487	CUDA: non-contiguous (RMS) norm support (llama/11659) * CUDA: non-contiguous (RMS) norm support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-27 08:55:36 +02:00
uvos	c49ee07ff4	HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (llama/11601) This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly	2025-02-03 22:00:57 +02:00
uvos	fc2e44490d	HIP: Prepare reduction operators for wave 64	2025-02-03 22:00:57 +02:00
uvos	f41fdad200	CUDA/HIP: add warp_size to cuda_device_info	2025-02-03 22:00:57 +02:00
Nikita Sarychev	115716d109	HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug (llama/11080) This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.	2025-02-03 22:00:57 +02:00
Haus1	028511d349	AMD: parse the architecture as supplied by gcnArchName (llama/11244) The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.	2025-02-03 22:00:57 +02:00
uvos	a160fa0f3a	Hip: disable VMM on hip as it seams that it dosent work in some configurations (llama/11420)	2025-02-03 22:00:57 +02:00
uvos	0282ad8fd1	hip : Add hipGraph and VMM support to ROCM (llama/11362) * Add hipGraph support * Enable VMM on rocm	2025-02-03 22:00:57 +02:00
Johannes Gäßler	9e467815d4	CUDA: fix FP16 cuBLAS GEMM (llama/11396)	2025-02-03 22:00:57 +02:00
uvos	727891d9bf	rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (llama/11356)	2025-02-03 22:00:57 +02:00
Johannes Gäßler	c262dc80e2	CPU/CUDA: fix (GQA) mul mat back, add CUDA support (llama/11380)	2025-02-03 22:00:57 +02:00
Johannes Gäßler	de49024e49	CUDA: backwards pass for misc. ops, add tests (llama/11257) * CUDA: backwards pass for misc. ops, add tests * remove restrict from pointers	2025-02-03 22:00:57 +02:00
Johannes Gäßler	54a2ee648f	RoPE: fix back, CUDA support for back + noncont. (llama/11240) * RoPE: fix back, CUDA support for back + noncont. * fix comments reg. non-cont. RoPE support [no-ci]	2025-02-03 22:00:57 +02:00
Andreas Kieslinger	2425caf4fd	cuda : CUDA Graph Compute Function Refactor (precursor for performance improvements) (llama/11042) * Refactor: Moves cuda graph executable update step to separate function. * Refactor: Moves cuda graph update check to separate function. * Refactor: Moves cuda graph maintenance (update or adjusting copy parameters) to separate function for improved readability. * Fix: Adds missing reference to maintain_cuda_graph() definition. * Refactor: Improves structure and abstractions by moving CUDA graph evaluation and capture to its own function. * Refactor: Moves node graph checks and copy ops into individual function for improved readability. * Refactor: Removes code permanently excluded from compilation to increase readability. * Style: Adds missing newline * Style: Consolidates several neighboring '#ifdef USE_CUDA_GRAPH' into a single one * Refactor: Makes 'cuda_graph_update_required' a local variable * remove double lines between functions --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-01-14 10:38:01 +02:00
Molly Sophia	06209f6683	llama: add support for QRWKV6 model architecture (llama/11001) llama: add support for QRWKV6 model architecture (llama/11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix some typos Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix cuda warning Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update README.md Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: compilade <git@compilade.net> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2025-01-14 10:38:01 +02:00
Johannes Gäßler	341f5c28e6	CUDA: add BF16 support (llama/11093) * CUDA: add BF16 support	2025-01-14 10:38:01 +02:00

1 2

57 Commits