82 Commits

Author SHA1 Message Date
Acly
87b88ed01c
vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)
* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)

* review: remove src_x/y < 0 checks; add performance tests
2025-05-07 13:17:39 +03:00
Eve
cf3eb291ab vulkan: matmul gcn tuning (llama/13016)
* tune matmul for gcn

* this one is more power efficient

* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

Co-authored-by: 0cc4m <picard12@live.de>

* disable this tune for the proprietary driver

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-04-24 20:39:16 +03:00
Jeff Bolz
27a56e7243 vulkan: support noncontiguous rms_norm (llama/13031) 2025-04-24 20:39:16 +03:00
Georgi Gerganov
36019c35a3 graph : make FA compatible with MLA + add initial Metal kernels (llama/12953)
* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci
2025-04-24 20:39:16 +03:00
Jeff Bolz
7db8f278f0 vulkan: enable coopmat2 FA gqa and split_k optimizations more often (llama/12931)
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
2025-04-24 20:39:16 +03:00
Jeff Bolz
b9bfe0c693 vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (llama/12833)
q4_k and q5_k had a lot of redundant global loads where the same 16B of
scale information is repeatedly loaded and decoded during each loop iteration.
This change restructures the loops to more explicitly iterate over whole
blocks in the outer loop (with unrolled inner loop) and to copy/decode the
scale data into shared memory once at the start of each outer loop. The copy
is pipelined so the scale load from global memory is relatively cheap.

This improves q4_k/q5_k model prompt processing performance by around 5-7%.
I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k
and hurt for q4_0.

The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped
variants isn't used as often as it originally was (e.g. due to the padded_N
change), so I trimmed it down to offset some of the new complexity of the
semi-manual loop unrolling.
2025-04-24 20:39:16 +03:00
Diego Devesa
b9c71fae5a ggml : add bilinear upscale support (ggml/1185) 2025-04-24 20:39:16 +03:00
Jeff Bolz
d792d2a2dc vulkan: Use unclamped loads for flash attention mask (llama/12720)
nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple
of the number of rows in the matrix. The KV dim is a multiple of the number of
columns for the aligned shader.
2025-04-24 20:39:16 +03:00
0cc4m
8add58aa5e Vulkan: Tune Vulkan mmq int dot shader for performance (llama/12767) 2025-04-24 20:39:16 +03:00
Jeff Bolz
76231bda56 vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (llama/12630)
There seems to be a bubble waking up from waitForFences, which costs a few
percent performance and also increased variance in performance. This change
inserts an "almost_ready" fence when the graph is about 80% complete and we
waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting
for the final fence to be signaled.
2025-04-24 20:39:16 +03:00
Jeff Bolz
b243416918 vulkan: Implement split_k for coopmat2 flash attention. (llama/12627)
When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.
2025-04-24 20:39:16 +03:00
Jeff Bolz
2105b110d3 vulkan: Implement grouped query attention in the coopmat2 FA shader (llama/12559)
When adjacent batches of Q share the same batches of K/V, batch them into
the same workgroup. For example, when:

dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))

previously we would run 32 workgroups computing 1 result each, now we will
run 8 workgroups computing 4 results each.

This doesn't directly translate to better performance (at least when you have
>=32 SMs), but in a subsequent change I'll enable split_k which will scale much
better with 4x fewer workgroups.
2025-04-24 20:39:16 +03:00
Wagner Bruna
801d6bd809 vulkan: fix build when glslc doesn't support coopmat (llama/12683) 2025-04-02 15:51:57 +03:00
0cc4m
0810f02547 Vulkan: Add DP4A MMQ and Q8_1 quantization shader (llama/12135)
* Vulkan: Add DP4A MMQ and Q8_1 quantization shader

* Add q4_0 x q8_1 matrix matrix multiplication support

* Vulkan: Add int8 coopmat MMQ support

* Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code

* Add GL_EXT_integer_dot_product check

* Remove ggml changes, fix mmq pipeline picker

* Remove ggml changes, restore Intel coopmat behaviour

* Fix glsl compile attempt when integer vec dot is not supported

* Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq

* Remove redundant comment

* Fix integer dot check

* Fix compile issue with unsupported int dot glslc

* Update Windows build Vulkan SDK version
2025-04-02 15:51:57 +03:00
Georgi Gerganov
27533e7f63 metal : improve FA + improve MoE (llama/12612)
* ggml : FA with different K, V head sizes (CPU)

ggml-ci

* metal : add FA with HS=192

* metal : extend FA to support different K and V head sizes

ggml-ci

* metal : add FA vector kernels for heads K 192 and V 128

ggml-ci

* ggml : restrict op on other backends to equal head sizes

ggml-ci

* metal : optimize FA-vec kernel

ggml-ci

* metal : FA remove mq registers

* metal : improve MoE mul_mat_id condition

ggml-ci

* metal : fix comments + remove unnecessary addition

ggml-ci

* metal : avoid too much shared memory usage with mul_mat_id

ggml-ci
2025-03-28 21:47:42 +02:00
Jeff Bolz
cbb88c4050 vulkan: Optimize mul_mat_vec p021 and nc shaders (llama/12505)
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.
2025-03-27 11:06:03 +02:00
stduhpf
13455c0b5f Vulkan: RTE rounding for cpy to quant (llama/12480)
* Vulkan: RTE rounding for cpy to quant

Co-Authored-By: Jeff Bolz <jbolz@nvidia.com>

* remove trailing whitespace

* avoid duplicating pipeline_cpy_f32_quant

* fix copypasting issue

* remove duplicated code

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-03-27 11:06:03 +02:00
Jeff Bolz
102af79f63 vulkan: Submit once enough matmul work has been recorded (llama/12406)
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
2025-03-27 11:06:03 +02:00
0cc4m
fa72479cfb Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (llama/12434) 2025-03-27 11:06:03 +02:00
Molly Sophia
52c4c03b0a llama: Add support for RWKV v7 architecture (llama/12412)
* ggml: Add op l2_norm

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: Add op rwkv_wkv7

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: Add support for RWKV7 and ARWKV7 models

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix inference with RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: add more (a)rwkv7 variants in size

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Apply code-format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* fix MUSA build

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix shape error with rwkv using llama-parallel

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-03-27 11:06:03 +02:00
Jeff Bolz
b3f3779c1b vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (llama/12312) 2025-03-27 11:06:03 +02:00
Daniele
13eeebb1b2 vulkan: subgroup size tuning (llama/12087)
* vulkan: subgroup size test

* Vulkan: Add device architecture enum and logic to recognize AMD generations

* vulkan: use new architecture logic to specify subgroup size

* Initial vulkan subgroup size tuning for RDNA3

* vulkan: commonize RDNA subgroup tuning

* vulkan: override subgroup size if required_subgroup_size = 0

* vulkan: disable warp 32 for RDNA3

* vulkan: fine tuned RDNA1 subgroup sizes

* vulkan: adjusted subgroup size map

* vulkan: fixed RDNA2 subgroup map

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-03-27 11:06:03 +02:00
Jeff Bolz
2cd3061a23 vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (llama/12273)
* vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking
2025-03-27 11:06:03 +02:00
Jeff Bolz
88d59e21b2 vulkan: Adjust coopmat2 tile sizes and selection heuristic (llama/12258) 2025-03-27 11:06:03 +02:00
Georgi Gerganov
54a54faee4 vulkan : sync (llama/0)
ggml-ci
2025-03-08 15:13:01 +02:00
William Tambellini
c98681e6d5 ggml : upgrade init_tensor API to return a ggml_status (llama/11854)
* Upgrade init_tensor API to return a ggml_status

To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.

* misc fixes

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-03-08 15:13:01 +02:00
Rémy O
3bab804981 vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (llama/11595)
* vulkan: implement specialized MMV kernels for IQ2 quantizations

* vulkan: add MMV kernels for IQ3 quants

* vulkan: Increase MMV batch size and unroll IQ LUT setup

* vulkan: fix init_iq_shmem for WG sizes larger than tables

* vulkan: common batch size for all I-quants
2025-03-08 15:13:01 +02:00
Jeff Bolz
a0f76b2da7 vulkan: fix assertion when qy_needs_dequant (llama/12068)
Looks like a copy/paste bug from qx_needs_dequant.
2025-03-08 15:13:01 +02:00
cmdr2
6ac8e6b2ce cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)
* cuda: restrict SILU_BACK to fp32, since fp16 exceeds the desired test threshold

* vulkan: specify fp32-only support for certain ops (that are now tested for fp16 as well)

* f32 sigmoid in vulkan supports op

* Revert "f32 sigmoid in vulkan supports op"

This reverts commit c6f04b3c19bf4504c2776149c6d8cd84e0b48acb.
2025-03-08 15:13:01 +02:00
Rémy O
37a21dd43d vulkan: implement several ops relevant for ggml_opt (llama/11769)
* vulkan: support memset_tensor

* vulkan: support GGML_OP_SUM

* vulkan: implement GGML_OP_ARGMAX

* vulkan: implement GGML_OP_SUB

* vulkan: implement GGML_OP_COUNT_EQUAL

* vulkan: implement GGML_OP_OPT_STEP_ADAMW

* vulkan: fix check_results RWKV_WKV6 crash and memory leaks

* vulkan: implement GGML_OP_REPEAT_BACK

* tests: remove invalid test-backend-ops REPEAT_BACK tests

* vulkan: fix COUNT_EQUAL memset using a fillBuffer command
2025-02-27 08:55:36 +02:00
Jeff Bolz
8a22a8b17f vulkan: support multi/vision rope, and noncontiguous rope (llama/11902) 2025-02-27 08:55:36 +02:00
Rémy O
1689aaf854 vulkan: initial support for IQ1_S and IQ1_M quantizations (llama/11528)
* vulkan: initial support for IQ1_S and IQ1_M quantizations

* vulkan: define MMV kernels for IQ1 quantizations

* devops: increase timeout of Vulkan tests again

* vulkan: simplify ifdef for init_iq_shmem
2025-02-27 08:55:36 +02:00
Eve
e22d69839d vulkan: linux builds + small subgroup size fixes (llama/11767)
* mm subgroup size

* upload vulkan x86 builds
2025-02-27 08:55:36 +02:00
Danny Milosavljevic
db6e19188a vulkan: Make Vulkan optional at runtime (ggml/11493). (llama/11494)
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-02-27 08:55:36 +02:00
Wagner Bruna
b4b063a5c9 vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (llama/11592) 2025-02-27 08:55:36 +02:00
Jeff Bolz
930b739e7a vulkan: account for lookup tables when checking shared memory size (llama/11502) 2025-02-27 08:55:36 +02:00
Jeff Bolz
be83f342fb vulkan: print shared memory size (llama/11719) 2025-02-27 08:55:36 +02:00
Rémy O
6f08b24146 vulkan: initial support for IQ4_XS quantization (llama/11501) 2025-02-27 08:55:36 +02:00
Jeff Bolz
7c165d7fa8 vulkan: use smaller combined allocations to avoid fragmentation (llama/11551) 2025-02-27 08:55:36 +02:00
Johannes Gäßler
bae6bbf487 CUDA: non-contiguous (RMS) norm support (llama/11659)
* CUDA: non-contiguous (RMS) norm support

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-27 08:55:36 +02:00
Rémy Oudompheng
80fa576254 vulkan: implement initial support for IQ2 and IQ3 quantizations (llama/11360)
* vulkan: initial support for IQ3_S

* vulkan: initial support for IQ3_XXS

* vulkan: initial support for IQ2_XXS

* vulkan: initial support for IQ2_XS

* vulkan: optimize Q3_K by removing branches

* vulkan: implement dequantize variants for coopmat2

* vulkan: initial support for IQ2_S

* vulkan: vertically realign code

* port failing dequant callbacks from mul_mm

* Fix array length mismatches

* vulkan: avoid using workgroup size before it is referenced

* tests: increase timeout for Vulkan llvmpipe backend

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-02-03 22:00:57 +02:00
Jeff Bolz
75e7d0585e vulkan: Catch pipeline creation failure and print an error message (llama/11436)
* vulkan: Catch pipeline creation failure and print an error message

Also, fix some warnings from my on-demand compile change.

* vulkan: fix pipeline creation logging
2025-02-03 22:00:57 +02:00
Jeff Bolz
7230a6e1c8 vulkan: compile shaders on-demand (llama/11406)
Reduce first-run startup time and memory consumption.

Should fix #11339.
2025-02-03 22:00:57 +02:00
amd-dwang
16eeb31933 Vulkan-run-test: fix mmq_wg_denoms (llama/11343)
There should be a copy-and-paste error here.

*mmq_wg_denoms should be used together with *warptile_mmq, instead of
wg_denoms.
2025-02-03 22:00:57 +02:00
Jeff Bolz
3736706139 vulkan: fix diag_mask_inf (llama/11323)
With robustbufferaccess disabled, this shader was showing OOB stores. There
is a bounds check in the code, but the workgrouop dimensions were reversed vs
CUDA and it was running the wrong number of threads. So fix the workgroup
dimensions and disable robustness for this pipeline.
2025-02-03 22:00:57 +02:00
Jeff Bolz
0dcada42d4 vulkan: fix coopmat2 validation failures (llama/11284)
mul mat and flash attention shaders were loading f32 types directly into
A/B matrices, which happens to work but is technically invalid usage.
For FA, we can load it as an Accumulator matrix and convert and this
is not in the inner loop and is cheap enough. For mul mat, it's more
efficient to do this conversion in a separate pass and have the input(s)
be f16.

coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId
requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.
2025-02-03 22:00:57 +02:00
Jeff Bolz
668306ff2b vulkan: fix coopmat2 flash attention for non-contiguous inputs (llama/11281)
Add code similar to mul_mm_cm2 to force alignment of strides, to avoid
a performance regression.

Add noncontiguous FA tests in test-backend-ops.

Fixes #11268.
2025-02-03 22:00:57 +02:00
Jeff Bolz
7183a1eb72 vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (llama/11166)
* vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl

Shaders are based on cpy.cu.

* vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32

* ggml: copy q->f32 assumes some contiguity in the destination
2025-02-03 22:00:57 +02:00
0cc4m
cdb8aa2f2e Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (llama/11161)
* Vulkan: Remove float16 use in shaders

* Fix validation error about subgroup_size_control extension
2025-01-14 10:38:01 +02:00
Molly Sophia
06209f6683 llama: add support for QRWKV6 model architecture (llama/11001)
llama: add support for QRWKV6 model architecture (llama/11001)

* WIP: Add support for RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Some graph simplification

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add support for RWKV6Qwen2 with cpu and cuda GLA

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix some typos

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix wkv test & add gla test

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix cuda warning

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update README.md

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update ggml/src/ggml-cuda/gla.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix fused lerp weights loading with RWKV6

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* better sanity check skipping for QRWKV6 in llama-quant

thanks @compilade

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2025-01-14 10:38:01 +02:00