whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-06-20 23:55:04 +00:00

Author	SHA1	Message	Date
pengxin99	dc01aadb18	fix softmax r2r result wrong issue (llama/7811)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	e08c62149b	CUDA: refactor mmq, dmmv, mmvq (llama/7716) * CUDA: refactor mmq, dmmv, mmvq * fix out-of-bounds write * struct for qk, qr, qi * fix cmake build * mmq_type_traits	2024-06-16 18:19:48 +03:00
Georgi Gerganov	abab4500fa	ggml : refactor rope norm/neox (llama/7634) * ggml : unify rope norm/neox (CPU) * ggml : fix compile warning * ggml : remove GLM rope mode ggml-ci * metal : better rope implementation ggml-ci * cuda : better rope implementation ggml-ci * naming : n_orig_ctx -> n_ctx_orig ggml-ci * dev : add reminders to update backends ggml-ci * vulkan : fix ggml_rope_ext() usage * cuda : fix array size + indents ggml-ci	2024-06-16 18:19:48 +03:00
agray3	e666315fa8	Allow number of nodes in CUDA graph to change (llama/7738) Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.	2024-06-16 18:19:48 +03:00
Georgi Gerganov	3f869af14c	ggml : remove OpenCL (llama/7735) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	cbacb7634c	ggml : prevent builds with -ffinite-math-only (llama/7726) This enforces a check that -fno-finite-math-only was set and that the operating compiling mode is not in finite maths mode. This is because during rewriting of silu and softmax for cpu #7154 there emerged an issue where the result that was observed when >1 slot was nondeterministic as found by @JohannesGaessler. @LostRuins narrowed the problem down to -ffinite-math-only which was theorised to be due to SiLU, instead of flushing small values to 0, returns NaN or some other garbage. @jart proposed a fix that @ggerganov then implemented in this fix ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	6cc3b022ee	llama : offload to RPC in addition to other backends (llama/7640) * llama : offload to RPC in addition to other backends * - fix copy_tensor being called on the src buffer instead of the dst buffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names * add rpc-server to Makefile * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
Masaya, Kato	e5e38d4920	ggml : use OpenMP as a thread pool (llama/7606) * ggml: Added OpenMP for multi-threads processing * ggml : Limit the number of threads used to avoid deadlock * update shared state n_threads in parallel region * clear numa affinity for main thread even with openmp * enable openmp by default * fix msvc build * disable openmp on macos * ci : disable openmp with thread sanitizer * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
0cc4m	2a6bab5655	Vulkan Mixture of Experts (MoE) support (llama/7628) * Finish Vulkan mul_mat_id implementation * Add Vulkan sum_rows and div ops * Fix MUL_MAT_ID matrix matrix shader * Fix MUL_MAT_ID matrix vector shader dispatch size * Fix MUL_MAT_ID matrix vector shader and dispatch code * Update Vulkan CPU offload for MUL_MAT_ID * Fix crash when using split mode none and setting a main GPU	2024-06-16 18:19:48 +03:00
woachk	8c01c9b85c	kompute : implement op_getrows_f32 (llama/6403) op_getrows_f32 is required since https://github.com/ggerganov/llama.cpp/pull/6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.	2024-06-16 18:19:48 +03:00
Dave Airlie	d1123d795e	fix bug introduced in using calloc (llama/7701) compilade pointed this out on the previous MR	2024-06-16 18:19:48 +03:00
Johannes Gäßler	9b3d784020	Fix FlashAttention debug test, FP32 assert (llama/7684)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	a16137d13d	CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	5582039d0a	CUDA: quantized KV support for FA vec (llama/7527) * CUDA: quantized KV support for FA vec * try CI fix * fix commented-out kernel variants * add q8_0 q4_0 tests * fix nwarps > batch size * split fattn compile via extern templates * fix flake8 * fix metal tests * fix cmake * make generate_cu_files.py executable * add autogenerated .cu files * fix AMD * error if type_v != FP16 and not flash_attn * remove obsolete code	2024-06-16 18:19:48 +03:00
Georgi Gerganov	9a16c643e2	ggml : fix loongson compile warnings (llama/7537) * ggml : fix loongson compile warnings ggml-ci * Fix loongarch quantize test fail. Fix unexpected error introduced during rebase code. * tests : disable json test due to lack of python on the CI node ggml-ci --------- Co-authored-by: junchao-loongson <zhaojunchao@loongson.cn>	2024-06-16 18:19:48 +03:00
Chris Elrod	10a8a23100	faster avx512 exp implementation (llama/7551) * faster avx512 exp implementation * x->r * improve accuracy, handle special cases * remove `e`	2024-06-16 18:19:48 +03:00
junchao-loongson	29cfeef77f	ggml : fix loongarch build (O2 issue) (llama/7636)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	e66e9ea25b	metal : remove invalid asserts (llama/7617)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	276779a849	metal : add missing asserts (llama/7617)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	1f35ce61c1	ggml : fix YARN + add tests + add asserts (llama/7617) * tests : add rope tests ggml-ci * ggml : fixes (hopefully) ggml-ci * tests : add non-cont tests ggml-ci * cuda : add asserts for rope/norm + fix DS2 ggml-ci * ggml : assert contiguousness * tests : reduce RoPE tests ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	4b19cc3ed4	cuda : non-cont concat support (llama/7610) * tests : add non-cont concat tests * cuda : non-cont concat support ggml-ci	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	a535d348dd	llama-bench : add support for the RPC backend (llama/7435)	2024-06-16 18:19:48 +03:00
slaren	8f5dc729d9	ggml : use atomic_flag for critical section (llama/7598) * ggml : use atomic_flag for critical section * add windows shims	2024-06-16 18:19:48 +03:00
Georgi Gerganov	02fc147a0b	examples : adapt to new ggml_concat (ggml/0)	2024-06-16 18:19:48 +03:00
zhouwg	109148ac84	ggml : fix typo in ggml.c (llama/7603)	2024-06-16 18:19:48 +03:00
Meng, Hengyu	3563473d2c	Align GEMM dispatch (llama/7566) * align GEMM dispatch	2024-06-16 18:19:48 +03:00
Georgi Gerganov	046834198d	sycl : fix assert (llama/7563)	2024-06-16 18:19:48 +03:00
k.h.lai	0a2ad9de06	vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (llama/7552)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	39b0640b09	rpc : resource management rework (llama/7562) * rpc : resource management rework * address review comments	2024-06-16 18:19:48 +03:00
Neo Zhang	8dca71de64	fix ggml_sycl_mul_mat_id() to match the change of api (llama/7436) * fix mul_mat_id to match the change of api * rm comment * rm unused or duplicated code, rename as review comment	2024-06-16 18:19:48 +03:00
Georgi Gerganov	812787cbc5	ggml : generalize GGML_OP_CONCAT (llama/7563) * ggml : generalize GGML_OP_CONCAT (WIP) ggml-ci * tests : add dim != 2 tests * metal : generalize concat kernel * tests : naming * cuda : generalize concat kernel ggml-ci * sycl : add warning and assert * ggml : fix op params handling * metal : bugfix kernel ggml-ci * ggml : reimplement CPU and Metal * cuda : add asserts ggml-ci * ggml : fix ptrs ggml-ci	2024-06-16 18:19:48 +03:00
Djip007	68ef10805e	update HIP_UMA #7399 (llama/7414) * update HIP_UMA #7399 add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable. - get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103) * simplify code, more consistent style --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
agray3	96fdb90f5f	Allow multiple copy function pointers for CUDA graph kernel param updates (llama/7565) CUDA graphs require parameter updates to kernels associated with GGML_OP_CPY nodes. Previously the implementation only checked for a single CUDA kernel in such nodes, but this caused a bug in cases where 2 such kernels exist. This fixes the issue by using a vector to allow multiple function pointers to be stored and checked against. Fixes #7942	2024-06-16 18:19:48 +03:00
AidanBeltonS	e98f9ac554	Fix q_xxs using mul_mat_q (llama/7459)	2024-06-16 18:19:48 +03:00
AidanBeltonS	02d481595b	Add freq factors (llama/7495)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	7091c7ab5a	metal : add GGML_OP_REPEAT kernels (llama/7557) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	d70ccb75f5	metal : disable FA kernel for HS=256 (llama/7556) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	5ee048eb67	ggml : restore ggml_rope_xpos_inplace (ggml/0) ggml-ci	2024-06-16 18:19:48 +03:00
Masaya, Kato	37ed71c964	ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (llama/7433) * Add SVE support for q4_0_q8_0 q8_0_q8_0 * remove ifdef	2024-06-16 18:19:48 +03:00
Georgi Gerganov	8cd7a3df37	ggml : silence UB sanitizer error during iq2_xxs quantization (llama/0)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	04a3279320	ggml : remove ggml_flash_attn and ggml_flash_ff (llama/7463) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	45ddda8e0c	ggml : drop support for QK_K=64 (llama/7473) * ggml : drop support for QK_K=64 ggml-ci * opencl : restore QK_K=256 define	2024-06-16 18:19:48 +03:00
0cc4m	c41317fd66	Update vulkan rope implementation to support frequency factors (llama/7475)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	96b8419b27	CUDA: fix FA out-of-bounds reads (llama/7479)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	3c63f4cf35	CUDA: fix FA out-of-bounds writes (llama/7465)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	5848dfd9c8	cuda : fix compile warning (llama/7454)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	29ab5d0326	CUDA: remove incorrect precision check (llama/7454)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	c4d6958b3e	cuda : fix rope + add tests (llama/7452) * cuda : fix rope pos data ggml-ci * ggml : drop mode & 1 == 1 support for ggml_rope ggml-ci * ggml : support freq_factors for f16 rope (CPU) ggml-ci * tests : add rope tests using frequency factors ggml-ci	2024-06-16 18:19:48 +03:00
liuwei-git	c9dcb75118	llama : add phi3 128K model support (llama/7225) * add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
Georgi Gerganov	bbdbc3fc62	metal : handle F16 inf values, fix FA partial offload (llama/7434) ggml-ci	2024-06-16 18:19:48 +03:00

... 3 4 5 6 7 ...

1586 Commits