whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-05-28 04:54:13 +00:00

Author	SHA1	Message	Date
Rémy O	3bab804981	vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (llama/11595) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants	2025-03-08 15:13:01 +02:00
Eve	1fbb119b1e	vulkan: matmul dequantization improvements (llama/12015) * faster dequant for old quants * dont use unpack for iq4_nl * vec2 unpack for q8	2025-03-08 15:13:01 +02:00
Daniele	40dea850fd	vulkan: improve im2col (llama/11826) * vulkan: improve im2col performance	2025-03-08 15:13:01 +02:00
Rémy O	37a21dd43d	vulkan: implement several ops relevant for ggml_opt (llama/11769) * vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command	2025-02-27 08:55:36 +02:00
Jeff Bolz	8a22a8b17f	vulkan: support multi/vision rope, and noncontiguous rope (llama/11902)	2025-02-27 08:55:36 +02:00
Rémy O	1689aaf854	vulkan: initial support for IQ1_S and IQ1_M quantizations (llama/11528) * vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem	2025-02-27 08:55:36 +02:00
Jeff Bolz	ef51b4cba4	vulkan: optimize coopmat2 iq2/iq3 callbacks (llama/11521) * vulkan: optimize coopmat2 iq2/iq3 callbacks * build: trigger CI on GLSL compute shader changes	2025-02-27 08:55:36 +02:00
Rémy O	6f08b24146	vulkan: initial support for IQ4_XS quantization (llama/11501)	2025-02-27 08:55:36 +02:00
Rémy Oudompheng	80fa576254	vulkan: implement initial support for IQ2 and IQ3 quantizations (llama/11360) * vulkan: initial support for IQ3_S * vulkan: initial support for IQ3_XXS * vulkan: initial support for IQ2_XXS * vulkan: initial support for IQ2_XS * vulkan: optimize Q3_K by removing branches * vulkan: implement dequantize variants for coopmat2 * vulkan: initial support for IQ2_S * vulkan: vertically realign code * port failing dequant callbacks from mul_mm * Fix array length mismatches * vulkan: avoid using workgroup size before it is referenced * tests: increase timeout for Vulkan llvmpipe backend --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-02-03 22:00:57 +02:00
Jeff Bolz	ba523d5e22	vulkan: sort shaders for more deterministic binary (llama/11315) Fixes #11306.	2025-02-03 22:00:57 +02:00
Jeff Bolz	3736706139	vulkan: fix diag_mask_inf (llama/11323) With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline.	2025-02-03 22:00:57 +02:00
Jeff Bolz	0dcada42d4	vulkan: fix coopmat2 validation failures (llama/11284) mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16. coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.	2025-02-03 22:00:57 +02:00
Jeff Bolz	668306ff2b	vulkan: fix coopmat2 flash attention for non-contiguous inputs (llama/11281) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268.	2025-02-03 22:00:57 +02:00
Jeff Bolz	7183a1eb72	vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (llama/11166) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination	2025-02-03 22:00:57 +02:00
Jeff Bolz	09f3c66648	vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (llama/11206) Do masking on whole dwords, fetch all scales at once.	2025-02-03 22:00:57 +02:00
Jeff Bolz	62e2414620	vulkan: optimize coopmat2 q2_k dequant function (llama/11130)	2025-02-03 22:00:57 +02:00
Eve	164f13c6a9	vulkan: scale caching for k quants + misc fixes (llama/11081) * q6_k scale caching * 16 bit unpack * q4_k test (slow) * revert it * q3_k * q2_k * little stuff * try precalculating products of a and q2_k scales * Revert "try precalculating products of a and q2_k scales" This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b. * unpack should be u16, add vim swap to gitignore (about time) * better q4_k scales * q5_k * better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations * q2_k better dequant * q3_k optimizations * q3_k use hmask simd from cpu avx version * make the caches happy * q3_k separate out calculation * q2_k separate out * little stuff * use calc_superblock everywhere * q2_k optimize scale calculation * more barriers	2025-02-03 22:00:57 +02:00
Junil Kim	02aa86230a	fix: ggml: fix vulkan-shaders-gen build (llama/10448) * fix: ggml: fix vulkan-shaders-gen build The vulkan-shaders-gen target was not being built correctly in case of cross-compilation. Other outputs need to be built for the cross compile target, but vulkan-shaders-gen needs to be built for the host. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup - Add GGML_SHADERS_GEN_TOOLCHAIN CMake option. - Auto-detect host toolchain if not set. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup Use configure_file to generate host_toolchain.cmake from template * fix: ggml: Fix compile error Fix compile error not finding vulkan-shaders-gen * fix: vulkan-shaders-gen build and path handling Fix build issues with vulkan-shaders-gen: - Add target dependency for correct build order - Use CMAKE_HOST_SYSTEM_NAME for executable suffix - Fix MSVC output directory in host toolchain - Normalize path handling for cross-compilation * fix: improve host compiler detection in vulkan shader build Improve host compiler detection for vulkan shader generation: - Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches - Consolidate compiler detection logic - Fix Windows-specific MSVC detection - Ensure correct compiler search in cross-compilation * refactor: Simplify CMake function for detecting host compiler Simplified the CMake function to improve the process of detecting the host compiler. * fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt Since `vulkan-shader-gen.cpp` only requires the `glslc` executable and not the Vulkan headers or libraries, CMakeLists.txt needs to be corrected. (See: ecc93d0558fc3ecb8a5af69d2ece02fae4710ade) * refactor: Rename host_toolchain.cmake.in - Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in * refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN	2025-02-03 22:00:57 +02:00
0cc4m	cdb8aa2f2e	Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (llama/11161) * Vulkan: Remove float16 use in shaders * Fix validation error about subgroup_size_control extension	2025-01-14 10:38:01 +02:00
Mathieu Baudier	b08c3a88c8	Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (llama/11117) * Disable GL_KHR_cooperative_matrix Vulkan extension if not available. * Perform Vulkan extensions checks in a more sensible order * Remove unnecessary #ifdef directive	2025-01-14 10:38:01 +02:00
Jeff Bolz	cea5f1c52f	vulkan: optimize mul_mat for small values of N (llama/10991) Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base.	2025-01-04 10:45:01 +02:00
Jeff Bolz	2112462db4	vulkan: im2col and matmul optimizations for stable diffusion (llama/10942) * tests: Add im2col perf tests * vulkan: optimize im2col, more elements per thread * vulkan: increase small tile size for NV_coopmat2 * vulkan: change im2col to 512 elements per workgroup	2025-01-04 10:45:01 +02:00
Jeff Bolz	fc84ecd445	vulkan: Use push constant offset to handle misaligned descriptors (llama/10987)	2025-01-04 10:45:01 +02:00
Eve	8de1e99907	vulkan: multi-row k quants (llama/10846) * multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default	2025-01-04 10:45:01 +02:00
Peter	499af9294a	examples, ggml : fix GCC compiler warnings (llama/10983) Warning types fixed (observed under MSYS2 GCC 14.2.0): * format '%ld' expects argument of type 'long int', but argument has type 'size_t' * llama.cpp/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first)	2025-01-04 10:45:01 +02:00
Jeff Bolz	39c205f555	vulkan: optimize coopmat2 dequant functions (llama/10855) Change the code to do 16b loads when possible and extract the appropriate component late, so the code is effectively decoding a pair of elements and then selecting one. This can allow more commoning to happen in the compiler when neighboring elements are loaded.	2025-01-04 10:45:01 +02:00
Zhiyuan Li	a1ab9b5e91	rwkv6: add wkv6 support for Vulkan backend (llama/10829) * rwkv_wkv6 vulkan shader * RWKV_WKV6 Vulkan op tests passed Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * add [[unroll]] and remove unnecessary conditions * add uma support * fix erros in EditorConfig Checker --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Molly Sophia <mollysophia379@gmail.com>	2024-12-18 12:52:16 +02:00
Eve	c21fb10b28	vulkan: small mul_mat_vec optimizations (llama/10665) * double the number of rows per workgroup * Update ggml-vulkan.cpp * Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats * only increase the number of rows for amd and subgroup size 64 * fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested * use subgroup min and max to check for gcn (requires https://github.com/ggerganov/llama.cpp/pull/10721) * manual merge ggml-vulkan.cpp * set min and max subgroup size in any case * Also double the number of rows for Intel GPUs	2024-12-18 12:52:16 +02:00
0cc4m	e5e951672e	Vulkan: Use improved q4_k and q5_k dequant code in dequant shaders (llama/10798)	2024-12-18 12:52:16 +02:00
Jeff Bolz	479bd77169	vulkan: request round-to-even for fp16 in im2col/rope_head (llama/10767) Vulkan doesn't mandate a specific rounding mode, but the shader_float_controls feature allows rounding mode to be requested if the implementation supports it.	2024-12-18 12:52:16 +02:00
Eve	d8bf63a41b	vulkan: dynamic subgroup size for the remaining k quants (llama/10745) * q5_k q4_k q3_k q2_k q6_k multi row example * revert as multi row isnt faster for k quants	2024-12-18 12:52:16 +02:00
Jeff Bolz	86346f811e	vulkan: disable spirv-opt for coopmat shaders (llama/10763) There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly necessary anyway. Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it changes. Fix coopmat support reporting when glslc doesn't support NV_coopmat2.	2024-12-18 12:52:16 +02:00
stduhpf	9ffbd3d969	Vulkan: fix NaN in tanh.comp with AMD proprietary driver on Windows (llama/10723) * Vulkan: fix NaN in tanh.comp * Faster NaN-free tanh	2024-12-18 12:52:16 +02:00
Jeff Bolz	6585a890b4	vulkan: compile a test shader in cmake to check for coopmat2 support (llama/10713)	2024-12-18 12:52:16 +02:00
0cc4m	4a6d52efe6	Vulkan: VK_KHR_cooperative_matrix support to speed up prompt processing (llama/10597) * Vulkan: Implement VK_KHR_cooperative_matrix support in the matrix matrix multiplication shader * Improve performance with better q4_k and q5_k dequant and store unrolling * Add Vulkan MUL_MAT and MUL_MAT_ID accumulator precision selection * Rework mulmat shader selection and compilation logic, avoid compiling shaders that won't get used by device * Vulkan: Implement accumulator switch for specific mul mat mat shaders * Vulkan: Unroll more loops for more mul mat mat performance * Vulkan: Add VK_AMD_shader_core_properties2 support to read Compute Unit count for split_k logic * Disable coopmat support on AMD proprietary driver * Remove redundant checks * Add environment variable GGML_VK_DISABLE_COOPMAT to disable VK_KHR_cooperative_matrix support * Fix rebase typo * Fix coopmat2 MUL_MAT_ID pipeline selection	2024-12-18 12:52:16 +02:00
Jeff Bolz	b74b68212a	vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (llama/10206)	2024-12-18 12:52:16 +02:00
gn64	c4aed6831e	vulkan : fix soft_max.comp division by zero (#2633 ) Some checks failed CI / ubuntu-latest-gcc (linux/ppc64le, Debug) (push) Has been cancelled Details CI / ubuntu-latest-gcc (linux/ppc64le, Release) (push) Has been cancelled Details CI / ubuntu-latest-clang (linux/amd64, Debug) (push) Has been cancelled Details CI / ubuntu-latest-clang (linux/amd64, Release) (push) Has been cancelled Details CI / ubuntu-latest-clang (linux/arm64, Debug) (push) Has been cancelled Details CI / ubuntu-latest-clang (linux/arm64, Release) (push) Has been cancelled Details CI / ubuntu-latest-clang (linux/ppc64le, Debug) (push) Has been cancelled Details CI / ubuntu-latest-clang (linux/ppc64le, Release) (push) Has been cancelled Details CI / ubuntu-latest-gcc-sanitized (linux/amd64, ADDRESS) (push) Has been cancelled Details CI / ubuntu-latest-gcc-sanitized (linux/amd64, THREAD) (push) Has been cancelled Details CI / ubuntu-latest-gcc-sanitized (linux/amd64, UNDEFINED) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Has been cancelled Details CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled Details CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled Details CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled Details CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Has been cancelled Details CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Has been cancelled Details CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Has been cancelled Details CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Has been cancelled Details CI / emscripten (Release) (push) Has been cancelled Details CI / ios-xcode-build (Release) (push) Has been cancelled Details CI / android (push) Has been cancelled Details CI / quantize (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64,linux/arm64 tag:main]) (push) Has been cancelled Details This change prevents a division by zero error when p.KY is 0.	2024-12-16 12:34:38 +02:00
Jeff Bolz	491ec076b4	vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (llama/10642)	2024-12-08 20:14:35 +02:00
Jeff Bolz	015ecd0001	vulkan: optimize and reenable split_k (llama/10637) Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.	2024-12-08 20:14:35 +02:00
Diego Devesa	3daeacad24	ggml : move AMX to the CPU backend (llama/10570) ggml : automatic selection of best CPU backend (llama/10606)	2024-12-08 20:14:35 +02:00
Eve	30e35d7271	vulkan: Dynamic subgroup size support for Q6_K mat_vec (llama/10536) * subgroup 64 version with subgroup add. 15% faster scalable version tested for subgroup sizes 16-128 * check for subgroup multiple of 16 and greater than 16 * subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45) * force 16 sequential threads per block * make 16 subgroup size a constant	2024-12-08 20:14:35 +02:00
Jeff Bolz	6463e36369	vulkan: define all quant data structures in types.comp (llama/10440)	2024-12-08 20:14:35 +02:00
Jeff Bolz	ab5d4d93ec	vulkan: further optimize q5_k mul_mat_vec (llama/10479)	2024-12-08 20:14:35 +02:00
Jeff Bolz	2d6e9dd723	vulkan: skip integer div/mod in get_offsets for batch_idx==0 (llama/10506)	2024-12-08 20:14:35 +02:00
Jeff Bolz	2f16e51553	vulkan: optimize Q2_K and Q3_K mul_mat_vec (llama/10459)	2024-12-08 20:14:35 +02:00
Jeff Bolz	5e1fcc1780	vulkan: fix group_norm (llama/10496) Fix bad calculation of the end of the range. Add a backend test that covers the bad case (taken from stable diffusion). Fixes https://github.com/leejet/stable-diffusion.cpp/issues/439.	2024-12-08 20:14:35 +02:00
Junil Kim	78dfec6bc5	vulkan: Fix a vulkan-shaders-gen arugment parsing error (llama/10484) The vulkan-shaders-gen was not parsing the --no-clean argument correctly. Because the previous code was parsing the arguments which have a value only and the --no-clean argument does not have a value, it was not being parsed correctly. This commit can now correctly parse arguments that don't have values.	2024-12-08 20:14:35 +02:00
Jeff Bolz	04662748aa	vulkan: predicate max operation in soft_max shaders/soft_max (llama/10437) Fixes #10434	2024-12-08 20:14:35 +02:00
Jeff Bolz	a117279e13	vulkan: copy iq4_nl LUT into shared memory (llama/10409)	2024-12-08 20:14:35 +02:00
Jeff Bolz	bbb292ed38	vulkan: further optimize mul_mat_vec using larger loads (llama/10387) * vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec. Add some early returns for nonexistent rows in mul_mat_vec shaders. These can only be hit when dispatching a 2D grid of workgroups. Fix the logic for the 2D grid of workgroups to round up. Enable the pipeline robustness extension if it's available, and use it to disable robustness for these pipelines. The instructions to do the bounds checking contend for the same ALU resources as the bit twiddling dequant instructions. * vulkan: Add GLSL structure aliases for quant types to allow larger loads In Vulkan it's not possible to cast pointer types, so instead you have to declare an aliased binding for the memory with a different type. This commit adds aliases for the quant formats using 16b ints, and in a few places where the struct size is a multiple of 4 also using 32b ints. Currently only q4_k's aliases are used, but others will be used in subsequent commits. * vulkan: use larger loads in q5_k and q6_k shaders. Similar to the optimization I did in q4_k recently, this vectorizes some loads and reduces the number of bit twiddling instructions. * vulkan: use larger K step per iteration in mul_mat_vec. Add vec4 dequantization functions, and use them to do K=8 per iteration in mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B which helps reduce the load on the memory system. The K_PER_ITER==2 logic is still there, just for F16/F32, and really only because they support unaligned sizes. Tweak the num_iters/unrolling logic to be simpler and catch a couple missed unrolling opportunities.	2024-12-08 20:14:35 +02:00

1 2

54 Commits