38 Commits

Author SHA1 Message Date
Eve
164f13c6a9 vulkan: scale caching for k quants + misc fixes (llama/11081)
* q6_k scale caching

* 16 bit unpack

* q4_k test (slow)

* revert it

* q3_k

* q2_k

* little stuff

* try precalculating products of a and q2_k scales

* Revert "try precalculating products of a and q2_k scales"

This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b.

* unpack should be u16, add vim swap to gitignore (about time)

* better q4_k scales

* q5_k

* better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations

* q2_k better dequant

* q3_k optimizations

* q3_k use hmask simd from cpu avx version

* make the caches happy

* q3_k separate out calculation

* q2_k separate out

* little stuff

* use calc_superblock everywhere

* q2_k optimize scale calculation

* more barriers
2025-02-03 22:00:57 +02:00
Junil Kim
02aa86230a fix: ggml: fix vulkan-shaders-gen build (llama/10448)
* fix: ggml: fix vulkan-shaders-gen build

The vulkan-shaders-gen target was not being built correctly
in case of cross-compilation.
Other outputs need to be built for the cross compile target,
but vulkan-shaders-gen needs to be built for the host.

* refactor: ggml: Improve vulkan-shaders-gen toolchain setup

- Add GGML_SHADERS_GEN_TOOLCHAIN CMake option.
- Auto-detect host toolchain if not set.

* refactor: ggml: Improve vulkan-shaders-gen toolchain setup

Use configure_file to generate host_toolchain.cmake from template

* fix: ggml: Fix compile error

Fix compile error not finding vulkan-shaders-gen

* fix: vulkan-shaders-gen build and path handling

Fix build issues with vulkan-shaders-gen:
- Add target dependency for correct build order
- Use CMAKE_HOST_SYSTEM_NAME for executable suffix
- Fix MSVC output directory in host toolchain
- Normalize path handling for cross-compilation

* fix: improve host compiler detection in vulkan shader build

Improve host compiler detection for vulkan shader generation:
- Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches
- Consolidate compiler detection logic
- Fix Windows-specific MSVC detection
- Ensure correct compiler search in cross-compilation

* refactor: Simplify CMake function for detecting host compiler

Simplified the CMake function to improve the process of detecting the host compiler.

* fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt

Since `vulkan-shader-gen.cpp` only requires the `glslc` executable
and not the Vulkan headers or libraries, CMakeLists.txt needs to
be corrected.
(See: ecc93d0558fc3ecb8a5af69d2ece02fae4710ade)

* refactor: Rename host_toolchain.cmake.in

- Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in

* refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN

Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN
2025-02-03 22:00:57 +02:00
0cc4m
cdb8aa2f2e Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (llama/11161)
* Vulkan: Remove float16 use in shaders

* Fix validation error about subgroup_size_control extension
2025-01-14 10:38:01 +02:00
Mathieu Baudier
b08c3a88c8 Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (llama/11117)
* Disable GL_KHR_cooperative_matrix Vulkan extension if not available.

* Perform Vulkan extensions checks in a more sensible order

* Remove unnecessary #ifdef directive
2025-01-14 10:38:01 +02:00
Jeff Bolz
cea5f1c52f vulkan: optimize mul_mat for small values of N (llama/10991)
Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where
the batch_strides are overloaded to hold the row strides. Put the loads from the
B matrix in the innermost loop because it should cache better.

Share some code for reducing the result values to memory in mul_mat_vec_base.
2025-01-04 10:45:01 +02:00
Jeff Bolz
2112462db4 vulkan: im2col and matmul optimizations for stable diffusion (llama/10942)
* tests: Add im2col perf tests

* vulkan: optimize im2col, more elements per thread

* vulkan: increase small tile size for NV_coopmat2

* vulkan: change im2col to 512 elements per workgroup
2025-01-04 10:45:01 +02:00
Jeff Bolz
fc84ecd445 vulkan: Use push constant offset to handle misaligned descriptors (llama/10987) 2025-01-04 10:45:01 +02:00
Eve
8de1e99907 vulkan: multi-row k quants (llama/10846)
* multi row k quant shaders!

* better row selection

* more row choices

* readjust row selection

* rm_kq=2 by default
2025-01-04 10:45:01 +02:00
Peter
499af9294a examples, ggml : fix GCC compiler warnings (llama/10983)
Warning types fixed (observed under MSYS2 GCC 14.2.0):
* format '%ld' expects argument of type 'long int', but argument has type 'size_t'
* llama.cpp/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers]  (emitted for all struct field except first)
2025-01-04 10:45:01 +02:00
Jeff Bolz
39c205f555 vulkan: optimize coopmat2 dequant functions (llama/10855)
Change the code to do 16b loads when possible and extract the appropriate
component late, so the code is effectively decoding a pair of elements and
then selecting one. This can allow more commoning to happen in the compiler
when neighboring elements are loaded.
2025-01-04 10:45:01 +02:00
Zhiyuan Li
a1ab9b5e91 rwkv6: add wkv6 support for Vulkan backend (llama/10829)
* rwkv_wkv6 vulkan shader

* RWKV_WKV6 Vulkan op tests passed

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Apply code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* add [[unroll]] and remove unnecessary conditions

* add uma support

* fix erros in EditorConfig Checker

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Molly Sophia <mollysophia379@gmail.com>
2024-12-18 12:52:16 +02:00
Eve
c21fb10b28 vulkan: small mul_mat_vec optimizations (llama/10665)
* double the number of rows per workgroup

* Update ggml-vulkan.cpp

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* only increase the number of rows for amd and subgroup size 64

* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested

* use subgroup min and max to check for gcn (requires https://github.com/ggerganov/llama.cpp/pull/10721)

* manual merge ggml-vulkan.cpp

* set min and max subgroup size in any case

* Also double the number of rows for Intel GPUs
2024-12-18 12:52:16 +02:00
0cc4m
e5e951672e Vulkan: Use improved q4_k and q5_k dequant code in dequant shaders (llama/10798) 2024-12-18 12:52:16 +02:00
Jeff Bolz
479bd77169 vulkan: request round-to-even for fp16 in im2col/rope_head (llama/10767)
Vulkan doesn't mandate a specific rounding mode, but the shader_float_controls
feature allows rounding mode to be requested if the implementation supports it.
2024-12-18 12:52:16 +02:00
Eve
d8bf63a41b vulkan: dynamic subgroup size for the remaining k quants (llama/10745)
* q5_k

q4_k

q3_k

q2_k

q6_k multi row example

* revert as multi row isnt faster for k quants
2024-12-18 12:52:16 +02:00
Jeff Bolz
86346f811e vulkan: disable spirv-opt for coopmat shaders (llama/10763)
There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly
necessary anyway.

Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it
changes.

Fix coopmat support reporting when glslc doesn't support NV_coopmat2.
2024-12-18 12:52:16 +02:00
stduhpf
9ffbd3d969 Vulkan: fix NaN in tanh.comp with AMD proprietary driver on Windows (llama/10723)
* Vulkan: fix NaN in tanh.comp

* Faster NaN-free tanh
2024-12-18 12:52:16 +02:00
Jeff Bolz
6585a890b4 vulkan: compile a test shader in cmake to check for coopmat2 support (llama/10713) 2024-12-18 12:52:16 +02:00
0cc4m
4a6d52efe6 Vulkan: VK_KHR_cooperative_matrix support to speed up prompt processing (llama/10597)
* Vulkan: Implement VK_KHR_cooperative_matrix support in the matrix matrix multiplication shader

* Improve performance with better q4_k and q5_k dequant and store unrolling

* Add Vulkan MUL_MAT and MUL_MAT_ID accumulator precision selection

* Rework mulmat shader selection and compilation logic, avoid compiling shaders that won't get used by device

* Vulkan: Implement accumulator switch for specific mul mat mat shaders

* Vulkan: Unroll more loops for more mul mat mat performance

* Vulkan: Add VK_AMD_shader_core_properties2 support to read Compute Unit count for split_k logic

* Disable coopmat support on AMD proprietary driver

* Remove redundant checks

* Add environment variable GGML_VK_DISABLE_COOPMAT to disable VK_KHR_cooperative_matrix support

* Fix rebase typo

* Fix coopmat2 MUL_MAT_ID pipeline selection
2024-12-18 12:52:16 +02:00
Jeff Bolz
b74b68212a vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (llama/10206) 2024-12-18 12:52:16 +02:00
gn64
c4aed6831e
vulkan : fix soft_max.comp division by zero (#2633)
Some checks failed
CI / ubuntu-latest-gcc (linux/ppc64le, Debug) (push) Has been cancelled
CI / ubuntu-latest-gcc (linux/ppc64le, Release) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/amd64, Debug) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/amd64, Release) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/arm64, Debug) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/arm64, Release) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/ppc64le, Debug) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/ppc64le, Release) (push) Has been cancelled
CI / ubuntu-latest-gcc-sanitized (linux/amd64, ADDRESS) (push) Has been cancelled
CI / ubuntu-latest-gcc-sanitized (linux/amd64, THREAD) (push) Has been cancelled
CI / ubuntu-latest-gcc-sanitized (linux/amd64, UNDEFINED) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled
CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Has been cancelled
CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Has been cancelled
CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Has been cancelled
CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Has been cancelled
CI / emscripten (Release) (push) Has been cancelled
CI / ios-xcode-build (Release) (push) Has been cancelled
CI / android (push) Has been cancelled
CI / quantize (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64,linux/arm64 tag:main]) (push) Has been cancelled
This change prevents a division by zero error when p.KY is 0.
2024-12-16 12:34:38 +02:00
Jeff Bolz
491ec076b4 vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (llama/10642) 2024-12-08 20:14:35 +02:00
Jeff Bolz
015ecd0001 vulkan: optimize and reenable split_k (llama/10637)
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
2024-12-08 20:14:35 +02:00
Diego Devesa
3daeacad24 ggml : move AMX to the CPU backend (llama/10570)
ggml : automatic selection of best CPU backend (llama/10606)
2024-12-08 20:14:35 +02:00
Eve
30e35d7271 vulkan: Dynamic subgroup size support for Q6_K mat_vec (llama/10536)
* subgroup 64 version with subgroup add. 15% faster

scalable version

tested for subgroup sizes 16-128

* check for subgroup multiple of 16 and greater than 16

* subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45)

* force 16 sequential threads per block

* make 16 subgroup size a constant
2024-12-08 20:14:35 +02:00
Jeff Bolz
6463e36369 vulkan: define all quant data structures in types.comp (llama/10440) 2024-12-08 20:14:35 +02:00
Jeff Bolz
ab5d4d93ec vulkan: further optimize q5_k mul_mat_vec (llama/10479) 2024-12-08 20:14:35 +02:00
Jeff Bolz
2d6e9dd723 vulkan: skip integer div/mod in get_offsets for batch_idx==0 (llama/10506) 2024-12-08 20:14:35 +02:00
Jeff Bolz
2f16e51553 vulkan: optimize Q2_K and Q3_K mul_mat_vec (llama/10459) 2024-12-08 20:14:35 +02:00
Jeff Bolz
5e1fcc1780 vulkan: fix group_norm (llama/10496)
Fix bad calculation of the end of the range. Add a backend test that
covers the bad case (taken from stable diffusion).

Fixes https://github.com/leejet/stable-diffusion.cpp/issues/439.
2024-12-08 20:14:35 +02:00
Junil Kim
78dfec6bc5 vulkan: Fix a vulkan-shaders-gen arugment parsing error (llama/10484)
The vulkan-shaders-gen was not parsing the --no-clean argument correctly.
Because the previous code was parsing the arguments which have a value only
and the --no-clean argument does not have a value, it was not being parsed
correctly. This commit can now correctly parse arguments that don't have values.
2024-12-08 20:14:35 +02:00
Jeff Bolz
04662748aa vulkan: predicate max operation in soft_max shaders/soft_max (llama/10437)
Fixes #10434
2024-12-08 20:14:35 +02:00
Jeff Bolz
a117279e13 vulkan: copy iq4_nl LUT into shared memory (llama/10409) 2024-12-08 20:14:35 +02:00
Jeff Bolz
bbb292ed38 vulkan: further optimize mul_mat_vec using larger loads (llama/10387)
* vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec.

Add some early returns for nonexistent rows in mul_mat_vec shaders. These
can only be hit when dispatching a 2D grid of workgroups. Fix the logic
for the 2D grid of workgroups to round up.

Enable the pipeline robustness extension if it's available, and use it to
disable robustness for these pipelines. The instructions to do the bounds
checking contend for the same ALU resources as the bit twiddling dequant
instructions.

* vulkan: Add GLSL structure aliases for quant types to allow larger loads

In Vulkan it's not possible to cast pointer types, so instead you have to
declare an aliased binding for the memory with a different type. This
commit adds aliases for the quant formats using 16b ints, and in a few
places where the struct size is a multiple of 4 also using 32b ints.
Currently only q4_k's aliases are used, but others will be used in
subsequent commits.

* vulkan: use larger loads in q5_k and q6_k shaders.

Similar to the optimization I did in q4_k recently, this vectorizes some loads
and reduces the number of bit twiddling instructions.

* vulkan: use larger K step per iteration in mul_mat_vec.

Add vec4 dequantization functions, and use them to do K=8 per iteration in
mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B
which helps reduce the load on the memory system.

The K_PER_ITER==2 logic is still there, just for F16/F32, and really only
because they support unaligned sizes.

Tweak the num_iters/unrolling logic to be simpler and catch a couple missed
unrolling opportunities.
2024-12-08 20:14:35 +02:00
Jeff Bolz
45f1f9144f vulkan: Optimize soft_max (llama/10301)
* vulkan: Optimize soft_max

Large soft_max could already saturate memory, but small/medium sizes were
pretty slow. The bulk of the gains for them comes from using a smaller
workgroup size, and making the workgroup size match the subgroup size also
makes the barriers much cheaper.

Cache some values in locals to avoid refetching/recomputing. And stamp
out a few "template instantiations" so smaller cases will fully unroll.

Add a missing early return for OOB rows. This happens when there are more
than 512 rows and the dispatch is 512 x H.

* vulkan: Further soft_max optimizations

Restore the workgroup size of 512 case, use it for >1024.

Use unrollable loops for more iteration counts.
2024-11-20 21:00:08 +02:00
Jeff Bolz
48862c7b27 vulkan: remove use of null initializer (llama/10372)
Seems like this isn't working for vulkan-over-metal when the array is sized
by a spec constant. Maybe a spirv-cross limitation?
2024-11-20 21:00:08 +02:00
Jeff Bolz
1bebb1a116 vulkan: Optimize some mat-vec mul quant shaders (llama/10296)
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.
2024-11-20 21:00:08 +02:00
Diego Devesa
746bf2596f ggml : build backends as libraries (llama/10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-20 21:00:08 +02:00