* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders
* vulkan: Optimize mul_mat_vec p021 and nc shaders.
These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).
Using subgroupAdd in the p021 shader also helps, use that conditionally.
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
* Upgrade init_tensor API to return a ggml_status
To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.
* misc fixes
---------
Co-authored-by: slaren <slarengh@gmail.com>
* vulkan: implement specialized MMV kernels for IQ2 quantizations
* vulkan: add MMV kernels for IQ3 quants
* vulkan: Increase MMV batch size and unroll IQ LUT setup
* vulkan: fix init_iq_shmem for WG sizes larger than tables
* vulkan: common batch size for all I-quants
* cuda: restrict SILU_BACK to fp32, since fp16 exceeds the desired test threshold
* vulkan: specify fp32-only support for certain ops (that are now tested for fp16 as well)
* f32 sigmoid in vulkan supports op
* Revert "f32 sigmoid in vulkan supports op"
This reverts commit c6f04b3c19bf4504c2776149c6d8cd84e0b48acb.
* vulkan: initial support for IQ1_S and IQ1_M quantizations
* vulkan: define MMV kernels for IQ1 quantizations
* devops: increase timeout of Vulkan tests again
* vulkan: simplify ifdef for init_iq_shmem
* vulkan: initial support for IQ3_S
* vulkan: initial support for IQ3_XXS
* vulkan: initial support for IQ2_XXS
* vulkan: initial support for IQ2_XS
* vulkan: optimize Q3_K by removing branches
* vulkan: implement dequantize variants for coopmat2
* vulkan: initial support for IQ2_S
* vulkan: vertically realign code
* port failing dequant callbacks from mul_mm
* Fix array length mismatches
* vulkan: avoid using workgroup size before it is referenced
* tests: increase timeout for Vulkan llvmpipe backend
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
With robustbufferaccess disabled, this shader was showing OOB stores. There
is a bounds check in the code, but the workgrouop dimensions were reversed vs
CUDA and it was running the wrong number of threads. So fix the workgroup
dimensions and disable robustness for this pipeline.
mul mat and flash attention shaders were loading f32 types directly into
A/B matrices, which happens to work but is technically invalid usage.
For FA, we can load it as an Accumulator matrix and convert and this
is not in the inner loop and is cheap enough. For mul mat, it's more
efficient to do this conversion in a separate pass and have the input(s)
be f16.
coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId
requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.
Add code similar to mul_mm_cm2 to force alignment of strides, to avoid
a performance regression.
Add noncontiguous FA tests in test-backend-ops.
Fixes#11268.
* vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl
Shaders are based on cpy.cu.
* vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32
* ggml: copy q->f32 assumes some contiguity in the destination
* q6_k scale caching
* 16 bit unpack
* q4_k test (slow)
* revert it
* q3_k
* q2_k
* little stuff
* try precalculating products of a and q2_k scales
* Revert "try precalculating products of a and q2_k scales"
This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b.
* unpack should be u16, add vim swap to gitignore (about time)
* better q4_k scales
* q5_k
* better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations
* q2_k better dequant
* q3_k optimizations
* q3_k use hmask simd from cpu avx version
* make the caches happy
* q3_k separate out calculation
* q2_k separate out
* little stuff
* use calc_superblock everywhere
* q2_k optimize scale calculation
* more barriers
* fix: ggml: fix vulkan-shaders-gen build
The vulkan-shaders-gen target was not being built correctly
in case of cross-compilation.
Other outputs need to be built for the cross compile target,
but vulkan-shaders-gen needs to be built for the host.
* refactor: ggml: Improve vulkan-shaders-gen toolchain setup
- Add GGML_SHADERS_GEN_TOOLCHAIN CMake option.
- Auto-detect host toolchain if not set.
* refactor: ggml: Improve vulkan-shaders-gen toolchain setup
Use configure_file to generate host_toolchain.cmake from template
* fix: ggml: Fix compile error
Fix compile error not finding vulkan-shaders-gen
* fix: vulkan-shaders-gen build and path handling
Fix build issues with vulkan-shaders-gen:
- Add target dependency for correct build order
- Use CMAKE_HOST_SYSTEM_NAME for executable suffix
- Fix MSVC output directory in host toolchain
- Normalize path handling for cross-compilation
* fix: improve host compiler detection in vulkan shader build
Improve host compiler detection for vulkan shader generation:
- Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches
- Consolidate compiler detection logic
- Fix Windows-specific MSVC detection
- Ensure correct compiler search in cross-compilation
* refactor: Simplify CMake function for detecting host compiler
Simplified the CMake function to improve the process of detecting the host compiler.
* fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt
Since `vulkan-shader-gen.cpp` only requires the `glslc` executable
and not the Vulkan headers or libraries, CMakeLists.txt needs to
be corrected.
(See: ecc93d0558fc3ecb8a5af69d2ece02fae4710ade)
* refactor: Rename host_toolchain.cmake.in
- Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in
* refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN
Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN
* Disable GL_KHR_cooperative_matrix Vulkan extension if not available.
* Perform Vulkan extensions checks in a more sensible order
* Remove unnecessary #ifdef directive