whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-06-22 08:30:07 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	9df53b357e	ggml : sync remnants (skip) (#0 )	2024-12-08 22:48:25 +02:00
uvos	230e985633	Add some minimal optimizations for CDNA (llama/10498) * Add some minimal optimizations for CDNA * ggml_cuda: set launch bounds also for GCN as it helps there too	2024-12-08 20:14:35 +02:00
Georgi Gerganov	48f421de23	cmake : enable warnings in llama (llama/10474) * cmake : enable warnings in llama ggml-ci * cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS * cmake : get_flags -> ggml_get_flags * speculative-simple : fix warnings * cmake : reuse ggml_get_flags ggml-ci * speculative-simple : fix compile warning ggml-ci	2024-12-08 20:14:35 +02:00
Diego Devesa	77e3e4a090	ggml : add support for dynamic loading of backends (llama/10469) * ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-08 20:14:35 +02:00
Diego Devesa	2a4b5c9d7e	cuda : optimize argmax (llama/10441) * cuda : optimize argmax * remove unused parameter ggml-ci * fixup : use full warps ggml-ci * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * fix ub * ggml : check ne00 <= INT32_MAX in argmax and argsort --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-12-08 20:14:35 +02:00
mahorozte	4af9626702	CUDA: remove unnecessary warp reduce in FA (ggml/1032) * kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit * same problem in vec32 --------- Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>	2024-12-08 20:14:35 +02:00
Georgi Gerganov	f4c1d7df39	ggml : sync resolve (skip) (#0 )	2024-11-20 21:00:08 +02:00
Diego Devesa	5f6d6919b4	cuda : fix CUDA_FLAGS not being applied (llama/10403)	2024-11-20 21:00:08 +02:00
Diego Devesa	7ac2f17fac	cuda : only use native when supported by cmake (llama/10389)	2024-11-20 21:00:08 +02:00
Johannes Gäßler	161b443514	CUDA: fix MMV kernel being used for FP16 src1 (llama/10357)	2024-11-20 21:00:08 +02:00
Johannes Gäßler	ef7fbe1c66	CMake: fix typo in comment [no ci] (llama/10360)	2024-11-20 21:00:08 +02:00
Johannes Gäßler	dcb2922d1d	CUDA: remove DMMV, consolidate F16 mult mat vec (llama/10318)	2024-11-20 21:00:08 +02:00
Johannes Gäßler	3c5c751174	CMake: default to -arch=native for CUDA build (llama/10320)	2024-11-20 21:00:08 +02:00
Johannes Gäßler	c9541741e6	ggml: new optimization interface (ggml/988) * ggml: new optimization interface remove test2.c, test3.c store adamw params in tensor move grads from tensor to graph * avoid segfault upon API misuse * add ggml-opt.h to public headers * remove dependence of ggml-opt.cpp on ggml-cpu.h	2024-11-20 21:00:08 +02:00
Diego Devesa	746bf2596f	ggml : build backends as libraries (llama/10256) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-20 21:00:08 +02:00
SXX	b890243690	ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213 )	2024-11-15 15:21:04 +02:00
Georgi Gerganov	d0b8335789	metal : optimize FA kernels (llama/10171) * ggml : add ggml_flash_attn_ext_get_prec * metal : use F16 precision in FA kernels ggml-ci * metal : minor clean-up * metal : compile-guard bf16 FA kernels ggml-ci * build : remove obsolete compile flag [no ci] * metal : prevent int overflows [no ci] * cuda : disable BF16 FA ggml-ci * metal : fix BF16 requirement for FA kernels ggml-ci * make : clean-up [no ci]	2024-11-15 15:21:04 +02:00
Zhiyuan Li	42398f13b0	Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (llama/10133) * rwkv6: rename to wkv6 * rwkv6: support avx2 avx512 armv8 armv9 * rwkv6: update cuda file name * rwkv6: rename params * wkv on sycl * sycl: add some ops * sycl: Enhance OP support judgment * wkv6: drop armv9 and tranfer to GGML style ggml-ci * sync : ggml * update the function to use appropriate types * fix define error * Update ggml/src/ggml-cpu.c * add appropriate asserts * move element-wise functions outside * put the declaration outside the loop * rewrite to be more inline with the common pattern for distributing threads * use recommended way GGML_TENSOR_LOCALS --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Plamen Minev <pacominev@gmail.com> Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com> Co-authored-by: Meng, Hengyu <airdldl@163.com>	2024-11-15 15:21:04 +02:00
Johannes Gäßler	ab0385f43b	CUDA: fix MMQ for non-contiguous src0, add tests (llama/10021) * CUDA: fix MMQ for non-contiguous src0, add tests * revise test code	2024-11-01 10:19:05 +02:00
bssrdf	10eb603a3c	increase cuda_cpy block size (ggml/996) Co-authored-by: bssrdf <bssrdf@gmail.com>	2024-11-01 10:19:05 +02:00
Johannes Gäßler	84713613be	CUDA: fix 1D im2col, add tests (ggml/993)	2024-11-01 10:19:05 +02:00
agray3	042e95d92f	Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816) * Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-11-01 10:19:05 +02:00
Johannes Gäßler	936cf3beb7	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-05 15:23:51 +03:00
Johannes Gäßler	bb57ecb85e	CUDA: remove bad assert (ggml/972)	2024-10-03 12:22:17 +03:00
Ivan	2fc1d20f9e	cuda: add q8_0->f32 cpy operation (llama/9571) llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.	2024-09-24 19:45:08 +03:00
R0CKSTAR	13f41af43e	musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (llama/9526) * mtgpu: add mp_21 support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable unified memory Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-09-24 19:45:08 +03:00
Johannes Gäßler	adf2474b10	CUDA: enable Gemma FA for HIP/Pascal (llama/9581)	2024-09-24 19:45:08 +03:00
Molly Sophia	008816a257	RWKV v6: RWKV_WKV op CUDA implementation (llama/9454) * ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-24 19:45:08 +03:00
agray3	f0a7d65b3d	Update CUDA graph on scale change plus clear nodes/params (llama/9550) * Avoid using saved CUDA graph if scale changes and reset nodes/params on update Fixes https://github.com/ggerganov/llama.cpp/issues/9451 * clear before resize	2024-09-24 19:45:08 +03:00
Georgi Gerganov	d245d7aec7	ggml : fix builds (llama/0) ggml-ci	2024-09-24 19:45:08 +03:00
Johannes Gäßler	c0761c95f5	CUDA: fix sum.cu compilation for CUDA < 11.7 (llama/9562)	2024-09-24 19:45:08 +03:00
Johannes Gäßler	a53b69a003	CUDA: fix --split-mode row race condition (llama/9413)	2024-09-24 19:45:08 +03:00
R0CKSTAR	d1c9b47360	musa: remove Clang builtins mapping (llama/9421) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-09-24 19:45:08 +03:00
Johannes Gäßler	67725ac8f3	CUDA: fix variable name conflict for Windows build (llama/9382)	2024-09-24 19:45:08 +03:00
Georgi Gerganov	26225f1fb0	cuda : fix FA Q src index (1 -> 0) (llama/9374)	2024-09-24 19:45:08 +03:00
Johannes Gäßler	c7515b0995	ggml/examples: add backend support for numerical optimization (ggml/949) * CUDA eval works * stochastic gradient descent op * Adam except decay * CUDA CROSS_ENTROPY_LOSS_BACK * CUDA mnist-fc training works * backend CLI arg * refactor gguf load * remove sched from opt_step_adam * implement l1 regularization (weight decay) * extra call to add optimizer * initialize gradients with ggml_graph_reset * gradient accumulation * increment iter per eval instead of epoch * adjust backend interfaces * fix ggml_graph_reset without backend * fix ggml graph export/import * fixup * rename * revert ggml_opt changes * more general CUDA repeat_back * update documentation, fix CNN * validation split * add clarifying comment * optimize PyTorch training * adjust buffer size, thread count * fix 0.0f validation split * Update examples/mnist/mnist-common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix gradient accumulation * tensor flag for accumulators -> tensor hash set * Update include/ggml.h Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * fix test prints * Update src/ggml-backend.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * better CUDA support for noncontiguous out_prod * add comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-09-24 19:45:08 +03:00
slaren	709a22b92d	cuda : fix defrag with quantized KV (llama/9319)	2024-09-24 19:45:08 +03:00
Johannes Gäßler	5d6dc19f04	tests: add gradient tests for all backends (ggml/932) * tests: add gradient checking to test-backend-ops * remove old comment * reorder includes * adjust SIN/COS parameters * add documentation, use supports_op if possible	2024-09-24 19:45:08 +03:00
Johannes Gäßler	24d8534bd8	CPU/CUDA: Gemma 2 FlashAttention support (llama/8542) * CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check	2024-08-28 13:22:20 +03:00
Daniel Bevenius	60098d6204	ggml : move rope type enum to ggml.h (llama/8949) * ggml : move rope type enum to ggml.h This commit moves the `llama_rope_type` enum from `llama.h` to `ggml.h` and changes its name to `ggml_rope_type`. The motivation for this change is to address the TODO in `llama.h` and use the enum in ggml. Note: This commit does not change the `mode` parameter to be of type `enum ggml_rope_type`. The name `mode` and its usage suggest that it might be more generic and possibly used as a bit field for multiple flags. Further investigation/discussion may be needed to determine if `mode` should be restricted to RoPE types. * squash! ggml : move rope type enum to ggml.h This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from ggml.h, and back the llama_rope_type enum. I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is safe to remove it yet. * squash! ggml : move rope type enum to ggml.h This commit removes the enum ggml_rope_type from ggml.h and replaces it with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has been updated to reflect this change. * squash! ggml : move rope type enum to ggml.h This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX macro/define to be passed to the shader compiler. * squash! ggml : move rope type enum to ggml.h This commit fixes the editorconfig-checker warnings. * squash! ggml : move rope type enum to ggml.h Update comment for ggml_rope function. * Revert "squash! ggml : move rope type enum to ggml.h" This reverts commit 6261222bd0dc0efd51f0fb0435ad3f16a5b52fd6. * squash! ggml : move rope type enum to ggml.h Add GGML_ROPE_TYPE_NEOX to rope_common.comp. * remove extra line --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-28 13:22:20 +03:00
Johannes Gäßler	8954769aa2	feat: ref. cross entropy, add CUDA, fix grad test (ggml/929)	2024-08-28 13:22:20 +03:00
Radoslav Gerganov	b6c05ce82f	yolo : add backend support (ggml/924) * yolo : add backend support * metal : add sub and sqrt kernels --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-21 11:07:13 +03:00
Ronsor	3643120690	feat: add new `sin` and `cos` operators (ggml/919) * ggml : add sin/cos operators * ggml-cuda : add sin/cos operators * ggml : add corresponding tests for sin/cos * ggml : add backward computation for sin/cos operators * ggml-vulkan : add sin/cos operators * ggml-vulkan : add sin/cos shader source * metal : add sin, cos --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-21 11:07:13 +03:00
Molly Sophia	4160b930f1	ggml : add epsilon as a parameter for group_norm (llama/8818) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-08-08 22:48:46 +03:00
slaren	5218ea21b8	cuda : fix dmmv cols requirement to 2GGML_CUDA_DMMV_X (llama/8800) cuda : fix dmmv cols requirement to 2GGML_CUDA_DMMV_X update asserts * only use dmmv for supported types * add test	2024-08-08 22:48:46 +03:00
R0CKSTAR	49ac8872b4	cuda : organize vendor-specific headers into vendors directory (llama/8746) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-08-08 22:48:46 +03:00
R0CKSTAR	e471adcfa5	feat: Support Moore Threads GPU (llama/8383) * Update doc for MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in Makefile Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in CMake Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * CUDA => MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * MUSA adds support for __vsubss4 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix CI build failure Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-08-08 22:48:46 +03:00
slaren	dd916a2852	ggml : reduce hash table reset cost (llama/8698) * ggml : reduce hash table reset cost * fix unreachable code warnings after GGML_ASSERT(false) * GGML_ASSERT(false) -> GGML_ABORT("fatal error") * GGML_ABORT use format string	2024-08-08 22:48:46 +03:00
Jeroen Mostert	86506b0c5c	Allow all RDNA2 archs to use sdot4 intrinsic (llama/8629) The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.	2024-08-08 22:48:46 +03:00
Johannes Gäßler	8c4f30497a	CUDA: MMQ code deduplication + iquant support (llama/8495) * CUDA: MMQ code deduplication + iquant support * 1 less parallel job for CI build	2024-08-08 22:48:46 +03:00

1 2

61 Commits