whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-05-28 04:54:13 +00:00

Author	SHA1	Message	Date
Nicolò Scipione	d507b4cebe	SYCL: Introducing memory host pool (llama/11251) * Implement host pool for matrix_info Creating a new memory pool on the host to store memory location for matrix_info needed to launch gemm_batch from oneMKL/oneMath. Removing complex support in gemm_batch since it is not used in llama.cpp * Remove unnecessary headers and cast * Reorder member variable to avoid warning on initialization * Formatting * Remove unused variable * Address PR review feedback - remove warning --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-02-03 22:00:57 +02:00
Georgi Gerganov	90171055f3	cmake : add sanitizer flags for llama.cpp (llama/11279) * cmake : add sanitizer flags for llama.cpp ggml-ci * tests : fix compile warnings ggml-ci * cmake : move sanitizer flags to llama_add_compile_flags ggml-ci * cmake : move llama.cpp compile flags to top level lists ggml-ci * cmake : apply only sanitizer flags at top level ggml-ci * tests : fix gguf context use in same_tensor_data * gguf-test: tensor data comparison * dummy : trigger ggml-ci * unicode : silence gcc warnings ggml-ci * ci : use sanitizer builds only in Debug mode ggml-ci * cmake : add status messages [no ci] --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-02-03 22:00:57 +02:00
Jeff Bolz	668306ff2b	vulkan: fix coopmat2 flash attention for non-contiguous inputs (llama/11281) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268.	2025-02-03 22:00:57 +02:00
Radoslav Gerganov	fdc21fc87b	rpc : early register backend devices (llama/11262) Early register RPC devices and do not propagate RPC specifics in the llama model structures. ref: #10609	2025-02-03 22:00:57 +02:00
Jeff Bolz	7183a1eb72	vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (llama/11166) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination	2025-02-03 22:00:57 +02:00
Jeff Bolz	09f3c66648	vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (llama/11206) Do masking on whole dwords, fetch all scales at once.	2025-02-03 22:00:57 +02:00
Jeff Bolz	62e2414620	vulkan: optimize coopmat2 q2_k dequant function (llama/11130)	2025-02-03 22:00:57 +02:00
Johannes Gäßler	de49024e49	CUDA: backwards pass for misc. ops, add tests (llama/11257) * CUDA: backwards pass for misc. ops, add tests * remove restrict from pointers	2025-02-03 22:00:57 +02:00
fj-y-saito	db6383094c	ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot (llama/11227) * Add SVE support for q4_K_q8_K * Update ggml/src/ggml-cpu/ggml-cpu-quants.c change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-03 22:00:57 +02:00
Eve	164f13c6a9	vulkan: scale caching for k quants + misc fixes (llama/11081) * q6_k scale caching * 16 bit unpack * q4_k test (slow) * revert it * q3_k * q2_k * little stuff * try precalculating products of a and q2_k scales * Revert "try precalculating products of a and q2_k scales" This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b. * unpack should be u16, add vim swap to gitignore (about time) * better q4_k scales * q5_k * better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations * q2_k better dequant * q3_k optimizations * q3_k use hmask simd from cpu avx version * make the caches happy * q3_k separate out calculation * q2_k separate out * little stuff * use calc_superblock everywhere * q2_k optimize scale calculation * more barriers	2025-02-03 22:00:57 +02:00
Junil Kim	02aa86230a	fix: ggml: fix vulkan-shaders-gen build (llama/10448) * fix: ggml: fix vulkan-shaders-gen build The vulkan-shaders-gen target was not being built correctly in case of cross-compilation. Other outputs need to be built for the cross compile target, but vulkan-shaders-gen needs to be built for the host. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup - Add GGML_SHADERS_GEN_TOOLCHAIN CMake option. - Auto-detect host toolchain if not set. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup Use configure_file to generate host_toolchain.cmake from template * fix: ggml: Fix compile error Fix compile error not finding vulkan-shaders-gen * fix: vulkan-shaders-gen build and path handling Fix build issues with vulkan-shaders-gen: - Add target dependency for correct build order - Use CMAKE_HOST_SYSTEM_NAME for executable suffix - Fix MSVC output directory in host toolchain - Normalize path handling for cross-compilation * fix: improve host compiler detection in vulkan shader build Improve host compiler detection for vulkan shader generation: - Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches - Consolidate compiler detection logic - Fix Windows-specific MSVC detection - Ensure correct compiler search in cross-compilation * refactor: Simplify CMake function for detecting host compiler Simplified the CMake function to improve the process of detecting the host compiler. * fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt Since `vulkan-shader-gen.cpp` only requires the `glslc` executable and not the Vulkan headers or libraries, CMakeLists.txt needs to be corrected. (See: ecc93d0558fc3ecb8a5af69d2ece02fae4710ade) * refactor: Rename host_toolchain.cmake.in - Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in * refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN	2025-02-03 22:00:57 +02:00
Johannes Gäßler	54a2ee648f	RoPE: fix back, CUDA support for back + noncont. (llama/11240) * RoPE: fix back, CUDA support for back + noncont. * fix comments reg. non-cont. RoPE support [no-ci]	2025-02-03 22:00:57 +02:00
Akarshan Biswas	9700cfb0a3	SYCL: Add gated linear attention kernel (llama/11175) * SYCL: Add Gated Linear attention kernel * glahpp: add a space at the end of file * gla: Put the barrier inside the main logic loop	2025-02-03 22:00:57 +02:00
William Tambellini	8e0143e205	ggml : add option to not print stack on abort (ggml/1081) * Add option to not print stack on abort Add option/envvar to disable stack printing on abort. Also link some unittests with Threads to fix link errors on ubuntu/g++11. * Update ggml/src/ggml.c --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-03 22:00:57 +02:00
issixx	f12559d590	ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065) some threads kept looping and failed to terminate properly after an abort during CPU execution. Co-authored-by: issi <issi@gmail.com>	2025-02-03 22:00:57 +02:00
Johannes Gäßler	d5ef1737d8	GGUF: C++ refactor, backend support, misc fixes (skip) (llama/11030) ggml-ci	2025-01-14 10:38:01 +02:00
lhez	1deb41f0e7	ggml : add opencl backend (skip) (llama/10693) --------- Co-authored-by: Skyler Szot <quic_sszot@quicinc.com> Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com> Co-authored-by: Alexander Angus <quic_aangus@quicinc.com> Co-authored-by: Hongqiang Wang <quic_wangh@quicinc.com> Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2025-01-14 10:38:01 +02:00
Andreas Kieslinger	2425caf4fd	cuda : CUDA Graph Compute Function Refactor (precursor for performance improvements) (llama/11042) * Refactor: Moves cuda graph executable update step to separate function. * Refactor: Moves cuda graph update check to separate function. * Refactor: Moves cuda graph maintenance (update or adjusting copy parameters) to separate function for improved readability. * Fix: Adds missing reference to maintain_cuda_graph() definition. * Refactor: Improves structure and abstractions by moving CUDA graph evaluation and capture to its own function. * Refactor: Moves node graph checks and copy ops into individual function for improved readability. * Refactor: Removes code permanently excluded from compilation to increase readability. * Style: Adds missing newline * Style: Consolidates several neighboring '#ifdef USE_CUDA_GRAPH' into a single one * Refactor: Makes 'cuda_graph_update_required' a local variable * remove double lines between functions --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-01-14 10:38:01 +02:00
Radoslav Gerganov	a4b00bcaaf	ggml : do not define GGML_USE_CUDA when building with GGML_BACKEND_DL (llama/11211) Build fails when using HIP and GGML_BACKEND_DL: ``` /usr/bin/ld: ../ggml/src/libggml.so: undefined reference to `ggml_backend_cuda_reg' collect2: error: ld returned 1 exit status ``` This patch fixes this.	2025-01-14 10:38:01 +02:00
0cc4m	cdb8aa2f2e	Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (llama/11161) * Vulkan: Remove float16 use in shaders * Fix validation error about subgroup_size_control extension	2025-01-14 10:38:01 +02:00
Molly Sophia	06209f6683	llama: add support for QRWKV6 model architecture (llama/11001) llama: add support for QRWKV6 model architecture (llama/11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix some typos Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix cuda warning Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update README.md Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: compilade <git@compilade.net> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2025-01-14 10:38:01 +02:00
Akarshan Biswas	c3235bd81e	SYCL: Refactor ggml_sycl_compute_forward (llama/11121) * SYCL: refactor ggml_sycl_compute_forward * SYCL: add back GGML_USED(dst) to ggml_sycl_cpy * SYCL: add function name to noop debug * SYCL: Some device info print refactoring and add details of XMX availability	2025-01-14 10:38:01 +02:00
hydai	262d0abc87	fix: add missing msg in static_assert (llama/11143) Signed-off-by: hydai <z54981220@gmail.com>	2025-01-14 10:38:01 +02:00
amritahs-ibm	124eec1664	llamafile : ppc64le MMA INT8 implementation (llama/10912) This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>	2025-01-14 10:38:01 +02:00
Mathieu Baudier	b08c3a88c8	Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (llama/11117) * Disable GL_KHR_cooperative_matrix Vulkan extension if not available. * Perform Vulkan extensions checks in a more sensible order * Remove unnecessary #ifdef directive	2025-01-14 10:38:01 +02:00
ag2s20150909	0afce25a69	fix: Vulkan shader gen binary path when Cross-compiling (llama/11096) * fix: Vulkan shader gen binary path when cross compiling	2025-01-14 10:38:01 +02:00
Johannes Gäßler	acdbe58631	GGUF: C++ refactor, backend support, misc fixes (llama/11030) * GGUF: C++ refactor, backend support, misc fixes remove ggml_tensor.backend update CODEOWNERS [no ci] remove gguf_get_data from API revise GGUF API data types	2025-01-14 10:38:01 +02:00
Diego Devesa	09fabffdf5	ggml-backend : only offload from host buffers (fix) (llama/11124)	2025-01-14 10:38:01 +02:00
Diego Devesa	3988d6396b	ggml-backend : only offload from host buffers (llama/11120)	2025-01-14 10:38:01 +02:00
Radoslav Gerganov	c8c63eeec0	rpc : code cleanup (llama/11107) Remove duplicated macros, use GGML_LOG_ERROR for errors	2025-01-14 10:38:01 +02:00
Akarshan Biswas	abf7f24410	SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6 (llama/11087) * SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6 * Revert "SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6" This reverts commit f62dc45f318e48d375e7734b34cbddee81deed52. * Reland: Use get_multi_ptr instead of deprecated get_pointer in wkv6	2025-01-14 10:38:01 +02:00
Johannes Gäßler	341f5c28e6	CUDA: add BF16 support (llama/11093) * CUDA: add BF16 support	2025-01-14 10:38:01 +02:00
0cc4m	5377099524	Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver (llama/11074) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check	2025-01-14 10:38:01 +02:00
matt23654	dcbb375779	Support for models with non-512-aligned tensors over RPC. (llama/11047) * Added init tensor calling code * Added get_alloc_size forwarding * Cleaned up and improved type/error handling. * fix: remove trailing whitespaces. * Cleanup and use GGML error logging functions. * Handle potentially dangerous edge cases. * Apply suggestions from code review Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-01-14 10:38:01 +02:00
Gilad S.	4334c71aed	fix: Vulkan shader gen binary path (llama/11037)	2025-01-14 10:38:01 +02:00
Radoslav Gerganov	e875a82473	ggml : allow loading backend with env variable (ggml/1059) ref: #1058	2025-01-14 10:38:01 +02:00
Georgi Gerganov	2e93cb6a2f	ggml : do not install metal source when embed library (ggml/1054)	2025-01-04 10:45:01 +02:00
Georgi Gerganov	de5cd60d1c	metal : avoid uint (llama/11019)	2025-01-04 10:45:01 +02:00
Srihari-mcw	3fcba3e58b	ggml : fixes for AVXVNNI instruction set with MSVC and Clang (llama/11027) * Fixes for clang AVX VNNI * enable AVX VNNI and alder lake build for MSVC * Apply suggestions from code review --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-01-04 10:45:01 +02:00
Jeff Bolz	cea5f1c52f	vulkan: optimize mul_mat for small values of N (llama/10991) Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base.	2025-01-04 10:45:01 +02:00
Jeff Bolz	2112462db4	vulkan: im2col and matmul optimizations for stable diffusion (llama/10942) * tests: Add im2col perf tests * vulkan: optimize im2col, more elements per thread * vulkan: increase small tile size for NV_coopmat2 * vulkan: change im2col to 512 elements per workgroup	2025-01-04 10:45:01 +02:00
Jeff Bolz	fc84ecd445	vulkan: Use push constant offset to handle misaligned descriptors (llama/10987)	2025-01-04 10:45:01 +02:00
Eve	8de1e99907	vulkan: multi-row k quants (llama/10846) * multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default	2025-01-04 10:45:01 +02:00
Peter	499af9294a	examples, ggml : fix GCC compiler warnings (llama/10983) Warning types fixed (observed under MSYS2 GCC 14.2.0): * format '%ld' expects argument of type 'long int', but argument has type 'size_t' * llama.cpp/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first)	2025-01-04 10:45:01 +02:00
Djip007	bcf937c216	ggml : more perfo with llamafile tinyblas on x86_64 (llama/10714) * more perfo with llamafile tinyblas on x86_64. - add bf16 suport - change dispache strategie (thanks: https://github.com/ikawrakow/ik_llama.cpp/pull/71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly * tinyblas dynamic dispaching * sgemm: add M blocs. * - git 2.47 use short id of len 9. - show-progress is not part of GNU Wget2 * remove not stable test	2025-01-04 10:45:01 +02:00
Diego Devesa	b8d90953d7	ggml : use wstring for backend search paths (llama/10960) ggml-ci	2025-01-04 10:45:01 +02:00
Diego Devesa	60a422147b	ggml : fix arm enabled features check (llama/10961)	2025-01-04 10:45:01 +02:00
Diego Devesa	3387415bad	ggml : fix const usage in SSE path (llama/10962)	2025-01-04 10:45:01 +02:00
yuri@FreeBSD	536ca3ec89	ggml : fix run-time on FreeBSD in get_executable_path() (llama/10948)	2025-01-04 10:45:01 +02:00
Jeff Bolz	a4bb983190	vulkan: build fixes for 32b (llama/10927) * vulkan: build fixes for 32b Should fix #10923 * vulkan: initialize some buffer/offset variables	2025-01-04 10:45:01 +02:00

1 2 3 4 5 ...

476 Commits