whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-05-02 08:43:02 +00:00

Author	SHA1	Message	Date
Sacha Arbonel	88d13a17a7	feat: add health check endpoint to server (#2968 ) Some checks failed CI / ubuntu-22-clang (linux/amd64, Release) (push) Has been cancelled Details CI / ubuntu-22-clang (linux/arm64, Debug) (push) Has been cancelled Details CI / ubuntu-22-clang (linux/arm64, Release) (push) Has been cancelled Details CI / ubuntu-22-clang (linux/ppc64le, Debug) (push) Has been cancelled Details CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Has been cancelled Details CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Has been cancelled Details CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Has been cancelled Details CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled Details CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled Details CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled Details CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Has been cancelled Details CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Has been cancelled Details CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Has been cancelled Details CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Has been cancelled Details CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Has been cancelled Details CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Has been cancelled Details CI / emscripten (Release) (push) Has been cancelled Details CI / android (push) Has been cancelled Details CI / quantize (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Has been cancelled Details CI / ios-xcode-build (Release) (push) Has been cancelled Details CI / release (push) Has been cancelled Details	2025-03-31 11:03:41 +03:00
Daniel Bevenius	f92bd59951	whisper : remove unnecessary GGML_UNUSED macro (#2960 ) Some checks failed CI / ubuntu-22-clang (linux/arm64, Debug) (push) Waiting to run Details CI / ubuntu-22-clang (linux/arm64, Release) (push) Waiting to run Details CI / ubuntu-22-clang (linux/ppc64le, Debug) (push) Waiting to run Details CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Waiting to run Details CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run Details CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run Details CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Waiting to run Details CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Waiting to run Details CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Waiting to run Details CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Waiting to run Details CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Waiting to run Details CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Waiting to run Details CI / emscripten (Release) (push) Waiting to run Details CI / ios-xcode-build (Release) (push) Blocked by required conditions Details CI / android (push) Waiting to run Details CI / quantize (push) Waiting to run Details CI / release (push) Blocked by required conditions Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Waiting to run Details Bindings Tests (Ruby) / ubuntu-22 (push) Has been cancelled Details	2025-03-30 05:56:10 +02:00
Georgi Gerganov	6e7629b146	sync : ggml Some checks failed CI / ubuntu-22-clang (linux/amd64, Release) (push) Has been cancelled Details CI / ubuntu-22-clang (linux/arm64, Debug) (push) Has been cancelled Details CI / ubuntu-22-clang (linux/arm64, Release) (push) Has been cancelled Details CI / ubuntu-22-clang (linux/ppc64le, Debug) (push) Has been cancelled Details CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Has been cancelled Details CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Has been cancelled Details CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Has been cancelled Details CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled Details CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled Details CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled Details CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Has been cancelled Details CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Has been cancelled Details CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Has been cancelled Details CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Has been cancelled Details CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Has been cancelled Details CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Has been cancelled Details CI / emscripten (Release) (push) Has been cancelled Details CI / android (push) Has been cancelled Details CI / quantize (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Has been cancelled Details CI / ios-xcode-build (Release) (push) Has been cancelled Details CI / release (push) Has been cancelled Details ggml-ci	2025-03-28 21:47:42 +02:00
Georgi Gerganov	27533e7f63	metal : improve FA + improve MoE (llama/12612) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci	2025-03-28 21:47:42 +02:00
Icenowy Zheng	1b81415963	vulkan: fix coopmat shader generation when cross-compiling (llama/12272) * vulkan: fix coopmat shader generation when cross-compiling Previously the status of coopmat{,2} support isn't passed to the vulkan-shaders-gen project building on the host, which leads to build failure because of the cross-compiling code expecting coopmat{,2} shaders that didn't get generated. Fix this by passing the coopmat{,2} support status to vulkan-shaders subproject. Signed-off-by: Icenowy Zheng <uwu@icenowy.me> * Only call coop-mat shaders once * Fix whitespace --------- Signed-off-by: Icenowy Zheng <uwu@icenowy.me> Co-authored-by: bandoti <141645996+bandoti@users.noreply.github.com>	2025-03-28 21:47:42 +02:00
amritahs-ibm	0001ec075f	llamafile : ppc64le GEMV forwarding for FP32. (llama/12594) This patch enables usage of MMA when one of the dimensions of the matrix(ie either M or N) is 1. This is useful in case of token generation where N < 2. The concept of 'GEMV Forwarding' is used where when one of the matrix has a single row/column, the elements are broadcasted, instead of using packing routine to prepack the matrix elements. This change results in 5% - 15% improvement in total speed(ie all tokens/total time), across various batch sizes. This is in comparision with the corresponding dot product implementation. The patch is tested with FP32 models of Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>	2025-03-28 21:47:42 +02:00
Radoslav Gerganov	5bad2e5099	rpc : send hash when tensor data is above some fixed threshold (llama/12496) * rpc : send hash when tensor data is above some fixed threshold ref #10095 * rpc : put cache under $HOME/.cache/llama.cpp * try to fix win32 build * another try to fix win32 build * remove llama as dependency	2025-03-28 21:47:42 +02:00
lhez	6fc0ae2f5a	opencl: add multi and vision rope, `gelu_quick` and `im2col` (llama/12600) * opencl: add `im2col` * opencl: add `gelu_quick` * opencl: add mrope * opencl: add vision rope	2025-03-28 21:47:42 +02:00
Amanda Der Bedrosian	de6b38c6d9	bindings.go : add DetectedLanguage to go bindings (#2947 ) Some checks failed CI / ubuntu-22-clang (linux/arm64, Debug) (push) Waiting to run Details CI / ubuntu-22-clang (linux/arm64, Release) (push) Waiting to run Details CI / ubuntu-22-clang (linux/ppc64le, Debug) (push) Waiting to run Details CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Waiting to run Details CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run Details CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run Details CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Waiting to run Details CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Waiting to run Details CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Waiting to run Details CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Waiting to run Details CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Waiting to run Details CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Waiting to run Details CI / emscripten (Release) (push) Waiting to run Details CI / ios-xcode-build (Release) (push) Blocked by required conditions Details CI / android (push) Waiting to run Details CI / quantize (push) Waiting to run Details CI / release (push) Blocked by required conditions Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Waiting to run Details Bindings Tests (Go) / ubuntu-22 (push) Has been cancelled Details Adding in DetectedLanguage(), a function to retrieve the detected language that's populated by processing audio. Also adding in a unit test to test the success. Co-authored-by: Amanda Der Bedrosian <aderbedrosian@sdl.com>	2025-03-28 12:26:22 +01:00
Daniel Bevenius	46d6e0abc1	ruby : fix test failures in test_whisper (#2955 ) * bindings.ruby : fix test failures in test_whisper This commit updates the parallel tests to use 2 processors instead of the number of processors on the system. It also comments out the setting of the log callback to an empty lambda as this causes a segfault when enabled. The motivation for the change to the number of processors is that if one has a large number of processors, for example I have 16 on the machine I used to test this, this would cause the following warning to be printed: ```console whisper_full_with_state: input is too short - 680 ms < 1000 ms. consider padding the input audio with silence ``` This is logged from: ```c++ int whisper_full_with_state( struct whisper_context * ctx, struct whisper_state * state, struct whisper_full_params params, const float * samples, int n_samples) { ... if (seek_end < seek_start + 100) { WHISPER_LOG_WARN("%s: input is too short - %d ms < 1000 ms. consider padding the input audio with silence\n", __func__, (seek_end - seek_start)10); return 0; } ``` This will return early and there will be segment callbacks to be invoked which in turn will cause the tests to fail. bindings.ruby : fix warnings in tests This commit fixes the following warnings in the Ruby tests: ```console /whisper/bindings/ruby/tests/test_segment.rb:52: warning: ambiguity between regexp and two divisions: wrap regexp in parentheses or add a space after `/' operator ``` And also adds a '_' prefix to some unused variables to avoid warnings. * bindings.ruby : enable Wisper.log_set in tests The commit reverts the commenting out of the Whisper.log_set call in the test_whisper.rb tests. I'm no longer getting segfaults when running the tests with this which was the case earlier. One theory could be that I rebased this to include the latest ggml sync to master to make sure things still worked. With the latest changes in ggml, I can't reproduce the segfaults.	2025-03-28 17:29:56 +09:00
Lin Xiaodong	1279f0d0bc	examples : support progress_callback API for addon.node (#2941 ) Some checks failed CI / ubuntu-22-clang (linux/arm64, Release) (push) Waiting to run Details CI / ubuntu-22-clang (linux/ppc64le, Debug) (push) Waiting to run Details CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Waiting to run Details CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run Details CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run Details CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Waiting to run Details CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Waiting to run Details CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Waiting to run Details CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Waiting to run Details CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Waiting to run Details CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Waiting to run Details CI / emscripten (Release) (push) Waiting to run Details CI / ios-xcode-build (Release) (push) Blocked by required conditions Details CI / android (push) Waiting to run Details CI / quantize (push) Waiting to run Details CI / release (push) Blocked by required conditions Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Waiting to run Details Examples Tests / addon_node-ubuntu-22 (16.x) (push) Has been cancelled Details Examples Tests / addon_node-ubuntu-22 (18.x) (push) Has been cancelled Details * feat: progress supported * fix: missing params * style: Format the code to improve readability Unified code indentation ensures consistent coding style, enhancing code readability and maintainability. * feat: support prompt api --------- Co-authored-by: linxiaodong <calm.lin@wukongsch.com>	2025-03-28 06:34:26 +01:00
Georgi Gerganov	f28bf5d186	xcf : fix visionOS build Some checks are pending CI / ubuntu-22-clang (linux/amd64, Release) (push) Waiting to run Details CI / ubuntu-22-clang (linux/arm64, Debug) (push) Waiting to run Details CI / ubuntu-22-clang (linux/arm64, Release) (push) Waiting to run Details CI / ubuntu-22-clang (linux/ppc64le, Debug) (push) Waiting to run Details CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Waiting to run Details CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Waiting to run Details CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Waiting to run Details CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run Details CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run Details CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Waiting to run Details CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Waiting to run Details CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Waiting to run Details CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Waiting to run Details CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Waiting to run Details CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Waiting to run Details CI / emscripten (Release) (push) Waiting to run Details CI / ios-xcode-build (Release) (push) Blocked by required conditions Details CI / android (push) Waiting to run Details CI / quantize (push) Waiting to run Details CI / release (push) Blocked by required conditions Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Waiting to run Details ref: https://github.com/ggml-org/llama.cpp/pull/12415 ggml-ci	2025-03-27 11:06:03 +02:00
Georgi Gerganov	1fbdfb1d36	files : remove old wkv6 (#0 ) ggml-ci	2025-03-27 11:06:03 +02:00
Georgi Gerganov	ee5581633b	sync : ggml ggml-ci	2025-03-27 11:06:03 +02:00
Georgi Gerganov	8ca67df291	ggml : sync/merge cmake,riscv,powerpc, add common.cmake (ggml/0)	2025-03-27 11:06:03 +02:00
amritahs-ibm	fc6d343e76	llamafile : ppc64le MMA implementation for Q4_0. (llama/12489) This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le ISA using MMA builtins. This patch handles matrix multiplication between quantised datatypes, block_q4_0 and block_q8_0. This change results in 5% - 50% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>	2025-03-27 11:06:03 +02:00
Akarshan Biswas	3199356d3a	SYCL: implement memset ggml backend buffer interface (llama/12580) * SYCL: implement memset ggml backend buffer interface * use GGML_ABORT macro * Do not wait for all queues to finish for memset operation	2025-03-27 11:06:03 +02:00
Slobodan Josic	e0c43b0bbf	HIP: Add support for RDNA4 targets (llama/12372)	2025-03-27 11:06:03 +02:00
Georgi Gerganov	f4f619ea8e	metal : refactor mat-vec code (llama/12569) * metal : refactor mat-vec code ggml-ci * metal : rename all_sum -> sum_all ggml-ci * metal : fix comments [no ci] * metal : fix nr constant [no ci] * metal : mv q6_K support nr0 > 1 ggml-ci * metal : reduce register pressure ggml-ci * metal : fix typo [no ci] * metal : reduce register pressure ggml-ci	2025-03-27 11:06:03 +02:00
Georgi Gerganov	3c4d363872	ggml : fix MUL_MAT_ID repack with Q8_K (llama/12544) * ggml : fix MUL_MAT_ID repack with Q8_K ggml-ci * ggml : improve repack templates ggml-ci	2025-03-27 11:06:03 +02:00
Dan Johansson	15aa189329	ggml-cpu : update KleidiAI to v1.5.0 (llama/12568) ggml-cpu : bug fix related to KleidiAI LHS packing Signed-off-by: Dan Johansson <dan.johansson@arm.com>	2025-03-27 11:06:03 +02:00
Akarshan Biswas	c53d5c9e85	SYCL: disable Q4_0 reorder optimization (llama/12560) ggml-ci	2025-03-27 11:06:03 +02:00
lhez	ba6f584f30	opencl: simplify kernel embedding logic in cmakefile (llama/12503) Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2025-03-27 11:06:03 +02:00
R0CKSTAR	a219941812	CUDA: Fix clang warnings (llama/12540) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-27 11:06:03 +02:00
Jeff Bolz	a2cc8c2666	vulkan: fix mul_mat_vec failure in backend tests (llama/12529) The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs.	2025-03-27 11:06:03 +02:00
Georgi Gerganov	388ed98220	ggml : fix quantized cpy op (llama/12310) * ggml : fix quantized cpy op ggml-ci * tests : add cpy tests for all types ggml-ci * tests : add BF16 copy tests ggml-ci * tests : fix loop for same-type copy ggml-ci * tests : add option to permute the dst tensor ggml-ci	2025-03-27 11:06:03 +02:00
R0CKSTAR	d487a28ae1	musa: refine compute capability (llama/12493) * musa: refine compute capability Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-27 11:06:03 +02:00
Jeff Bolz	cbb88c4050	vulkan: Optimize mul_mat_vec p021 and nc shaders (llama/12505) * tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.	2025-03-27 11:06:03 +02:00
stduhpf	13455c0b5f	Vulkan: RTE rounding for cpy to quant (llama/12480) * Vulkan: RTE rounding for cpy to quant Co-Authored-By: Jeff Bolz <jbolz@nvidia.com> * remove trailing whitespace * avoid duplicating pipeline_cpy_f32_quant * fix copypasting issue * remove duplicated code --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-03-27 11:06:03 +02:00
Eve	2f77a9e9bd	vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (llama/12472)	2025-03-27 11:06:03 +02:00
蕭澧邦	fa2b5249ff	Fix build on Windows when ccache enabled (ggml/9954) (llama/9976) * [SYCL] Fix build on Windows when ccache enabled (llama/9954) * take effect only on windows and force it to icl --------- Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>	2025-03-27 11:06:03 +02:00
Svetlozar Georgiev	5b854ebba5	sycl: cleanup oneDNN related code (llama/12097)	2025-03-27 11:06:03 +02:00
Srihari-mcw	8058f19d0b	ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (llama/12332) * Add block interleaving support for Q4_K quantization * Remove whitespaces and fix CI/CD issues * Update pointer of bsums from int16_t to const int16_t * Add vector version of quantize_q8_K_4x8 function * Update code formatting based on review comments	2025-03-27 11:06:03 +02:00
Gaurav Garg	ae6a9bb9a5	CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183) - Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value. - Prefer vector flash attention kernels over MMA kernel for BS=1 Fixes Issue: #12182 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-03-27 11:06:03 +02:00
Jeff Bolz	24faba9e9b	vulkan: optimize iq1 coopmat2 dequant functions (llama/12427)	2025-03-27 11:06:03 +02:00
Guus Waals	c722ff84d3	Fix visionOS build and add CI (llama/12415) * ci: add visionOS build workflow Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode. * ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs * ci: remove define hacks for u_xxx system types --------- Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>	2025-03-27 11:06:03 +02:00
Jeff Bolz	102af79f63	vulkan: Submit once enough matmul work has been recorded (llama/12406) I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.	2025-03-27 11:06:03 +02:00
lhez	03c364557d	opencl: improve profiling (llama/12442) * opencl: more profiling timing * opencl: generate trace for profiling * opencl: reduce profiling overhead * Populate profiling timing info at the end rather than after each kernel run * opencl: fix for chrome tracing	2025-03-27 11:06:03 +02:00
R0CKSTAR	31b62276cf	musa: override warp_size of musa device to 32 (llama/12445) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-27 11:06:03 +02:00
Łukasz Ślusarczyk	97b5a3055d	SYCL: using graphs is configurable by environment variable and compile option (llama/12371) * alberto changes * enable sycl graphs by env variable * fixed compilation warnings in ggml-sycl.cpp * renamed graph variables * fix markdown in docs/backend/SYCL.md Co-authored-by: Romain Biessy <romain.biessy@codeplay.com> * fix markdown in docs/backend/SYCL.md again * compiling graphs by default, renamed graph_enable to graph_disable --------- Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>	2025-03-27 11:06:03 +02:00
fj-y-saito	9993c3f703	ggml : add SVE support for q6_K_q8_K (llama/12361)	2025-03-27 11:06:03 +02:00
0cc4m	fa72479cfb	Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (llama/12434)	2025-03-27 11:06:03 +02:00
Łukasz Ślusarczyk	6c15539c54	fixed compilation warnings in ggml-sycl (llama/12424)	2025-03-27 11:06:03 +02:00
Molly Sophia	52c4c03b0a	llama: Add support for RWKV v7 architecture (llama/12412) * ggml: Add op l2_norm Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code-format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * fix MUSA build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-03-27 11:06:03 +02:00
Gaurav Garg	cfc2560e41	cuda : enable CUDA Graph on CUDA Toolkit < 12.x (llama/12394) * Enable CUDA Graph on CTK < 12.x `cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x. * Fix compilation errors with MUSA * Disable CUDA Graph for MUSA	2025-03-27 11:06:03 +02:00
Guus Waals	db6e8056b5	ggml-vulkan: remove unused find_program(glslc) (llama/12416) It's already found by FindVulkan.cmake in the parent CMakeLists	2025-03-27 11:06:03 +02:00
Jeff Bolz	b3f3779c1b	vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (llama/12312)	2025-03-27 11:06:03 +02:00
Daniele	13eeebb1b2	vulkan: subgroup size tuning (llama/12087) * vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-03-27 11:06:03 +02:00
Jeff Bolz	905b834af1	vulkan: use fp32 in coopmat2 q4_k dequant function (llama/12309)	2025-03-27 11:06:03 +02:00
Jeff Bolz	2cd3061a23	vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (llama/12273) * vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking	2025-03-27 11:06:03 +02:00

1 2 3 4 5 ...

2350 Commits