2331 Commits

Author SHA1 Message Date
Georgi Gerganov
3c4d363872 ggml : fix MUL_MAT_ID repack with Q8_K (llama/12544)
* ggml : fix MUL_MAT_ID repack with Q8_K

ggml-ci

* ggml : improve repack templates

ggml-ci
2025-03-27 11:06:03 +02:00
Dan Johansson
15aa189329 ggml-cpu : update KleidiAI to v1.5.0 (llama/12568)
ggml-cpu : bug fix related to KleidiAI LHS packing

Signed-off-by: Dan Johansson <dan.johansson@arm.com>
2025-03-27 11:06:03 +02:00
Akarshan Biswas
c53d5c9e85 SYCL: disable Q4_0 reorder optimization (llama/12560)
ggml-ci
2025-03-27 11:06:03 +02:00
lhez
ba6f584f30 opencl: simplify kernel embedding logic in cmakefile (llama/12503)
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
2025-03-27 11:06:03 +02:00
R0CKSTAR
a219941812 CUDA: Fix clang warnings (llama/12540)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
Jeff Bolz
a2cc8c2666 vulkan: fix mul_mat_vec failure in backend tests (llama/12529)
The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
2025-03-27 11:06:03 +02:00
Georgi Gerganov
388ed98220 ggml : fix quantized cpy op (llama/12310)
* ggml : fix quantized cpy op

ggml-ci

* tests : add cpy tests for all types

ggml-ci

* tests : add BF16 copy tests

ggml-ci

* tests : fix loop for same-type copy

ggml-ci

* tests : add option to permute the dst tensor

ggml-ci
2025-03-27 11:06:03 +02:00
R0CKSTAR
d487a28ae1 musa: refine compute capability (llama/12493)
* musa: refine compute capability

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
Jeff Bolz
cbb88c4050 vulkan: Optimize mul_mat_vec p021 and nc shaders (llama/12505)
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.
2025-03-27 11:06:03 +02:00
stduhpf
13455c0b5f Vulkan: RTE rounding for cpy to quant (llama/12480)
* Vulkan: RTE rounding for cpy to quant

Co-Authored-By: Jeff Bolz <jbolz@nvidia.com>

* remove trailing whitespace

* avoid duplicating pipeline_cpy_f32_quant

* fix copypasting issue

* remove duplicated code

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-03-27 11:06:03 +02:00
Eve
2f77a9e9bd vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (llama/12472) 2025-03-27 11:06:03 +02:00
蕭澧邦
fa2b5249ff Fix build on Windows when ccache enabled (ggml/9954) (llama/9976)
* [SYCL] Fix build on Windows when ccache enabled (llama/9954)

* take effect only on windows and force it to icl

---------

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
2025-03-27 11:06:03 +02:00
Svetlozar Georgiev
5b854ebba5 sycl: cleanup oneDNN related code (llama/12097) 2025-03-27 11:06:03 +02:00
Srihari-mcw
8058f19d0b ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (llama/12332)
* Add block interleaving support for Q4_K quantization

* Remove whitespaces and fix CI/CD issues

* Update pointer of bsums from int16_t to const int16_t

* Add vector version of quantize_q8_K_4x8 function

* Update code formatting based on review comments
2025-03-27 11:06:03 +02:00
Gaurav Garg
ae6a9bb9a5 CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-27 11:06:03 +02:00
Jeff Bolz
24faba9e9b vulkan: optimize iq1 coopmat2 dequant functions (llama/12427) 2025-03-27 11:06:03 +02:00
Guus Waals
c722ff84d3 Fix visionOS build and add CI (llama/12415)
* ci: add visionOS build workflow

Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode.

* ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs

* ci: remove define hacks for u_xxx system types

---------

Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>
2025-03-27 11:06:03 +02:00
Jeff Bolz
102af79f63 vulkan: Submit once enough matmul work has been recorded (llama/12406)
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
2025-03-27 11:06:03 +02:00
lhez
03c364557d opencl: improve profiling (llama/12442)
* opencl: more profiling timing

* opencl: generate trace for profiling

* opencl: reduce profiling overhead

* Populate profiling timing info at the end rather than after each
  kernel run

* opencl: fix for chrome tracing
2025-03-27 11:06:03 +02:00
R0CKSTAR
31b62276cf musa: override warp_size of musa device to 32 (llama/12445)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
Łukasz Ślusarczyk
97b5a3055d SYCL: using graphs is configurable by environment variable and compile option (llama/12371)
* alberto changes

* enable sycl graphs by env variable

* fixed compilation warnings in ggml-sycl.cpp

* renamed graph variables

* fix markdown in docs/backend/SYCL.md

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>

* fix markdown in docs/backend/SYCL.md again

* compiling graphs by default, renamed graph_enable to graph_disable

---------

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
2025-03-27 11:06:03 +02:00
fj-y-saito
9993c3f703 ggml : add SVE support for q6_K_q8_K (llama/12361) 2025-03-27 11:06:03 +02:00
0cc4m
fa72479cfb Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (llama/12434) 2025-03-27 11:06:03 +02:00
Łukasz Ślusarczyk
6c15539c54 fixed compilation warnings in ggml-sycl (llama/12424) 2025-03-27 11:06:03 +02:00
Molly Sophia
52c4c03b0a llama: Add support for RWKV v7 architecture (llama/12412)
* ggml: Add op l2_norm

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: Add op rwkv_wkv7

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: Add support for RWKV7 and ARWKV7 models

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix inference with RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: add more (a)rwkv7 variants in size

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Apply code-format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* fix MUSA build

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix shape error with rwkv using llama-parallel

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-03-27 11:06:03 +02:00
Gaurav Garg
cfc2560e41 cuda : enable CUDA Graph on CUDA Toolkit < 12.x (llama/12394)
* Enable CUDA Graph on CTK < 12.x

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

* Fix compilation errors with MUSA

* Disable CUDA Graph for MUSA
2025-03-27 11:06:03 +02:00
Guus Waals
db6e8056b5 ggml-vulkan: remove unused find_program(glslc) (llama/12416)
It's already found by FindVulkan.cmake in the parent CMakeLists
2025-03-27 11:06:03 +02:00
Jeff Bolz
b3f3779c1b vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (llama/12312) 2025-03-27 11:06:03 +02:00
Daniele
13eeebb1b2 vulkan: subgroup size tuning (llama/12087)
* vulkan: subgroup size test

* Vulkan: Add device architecture enum and logic to recognize AMD generations

* vulkan: use new architecture logic to specify subgroup size

* Initial vulkan subgroup size tuning for RDNA3

* vulkan: commonize RDNA subgroup tuning

* vulkan: override subgroup size if required_subgroup_size = 0

* vulkan: disable warp 32 for RDNA3

* vulkan: fine tuned RDNA1 subgroup sizes

* vulkan: adjusted subgroup size map

* vulkan: fixed RDNA2 subgroup map

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-03-27 11:06:03 +02:00
Jeff Bolz
905b834af1 vulkan: use fp32 in coopmat2 q4_k dequant function (llama/12309) 2025-03-27 11:06:03 +02:00
Jeff Bolz
2cd3061a23 vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (llama/12273)
* vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking
2025-03-27 11:06:03 +02:00
Jeff Bolz
88d59e21b2 vulkan: Adjust coopmat2 tile sizes and selection heuristic (llama/12258) 2025-03-27 11:06:03 +02:00
Christian Kastner
4917f122d4 cmake : enable building llama.cpp using system libggml (llama/12321)
* cmake: Factor out compiler flag function from ggml

llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).

* cmake: Enable building against system ggml

This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.
2025-03-27 11:06:03 +02:00
Akarshan Biswas
16a1b77249 SYCL: set extras only on GGML_TYPE_Q4_0 (llama/12366)
* SYCL: set extras only on GGML_TYPE_Q4_0

* release tensor_extras in reset buffer interface
2025-03-27 11:06:03 +02:00
aubreyli
51d1398a0a SYCL: Delete redundant plus sign and space (llama/12391) 2025-03-27 11:06:03 +02:00
fairydreaming
3499dd83c0 SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (llama/12399)
* sycl : support non-contiguous tensors in binary ops

* sycl : silence unused variable warning

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-03-27 11:06:03 +02:00
Chenguang Li
7b7d9ae35e MUL_MAT optimization (llama/12382) 2025-03-27 11:06:03 +02:00
Alberto Cabrera Pérez
2dcb7181ff sycl : variable sg_size support for mmvq kernels (llama/12336) 2025-03-27 11:06:03 +02:00
uvos
96ab3b2465 CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (llama/12315)
When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to
selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need
to avoid launching them with parameters for warp64
2025-03-27 11:06:03 +02:00
Jeff Bolz
08f32992d0 vulkan: fix bug in coopmat1 mul_mat_id (llama/12316)
* tests: run mul_mat_id with a larger N

* vulkan: fix bug in coopmat1 mul_mat_id
2025-03-27 11:06:03 +02:00
uvos
394fae57c3 CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (llama/12177)
refactor mmqv to unify the calculation of nwarps and rows per block between host and device code.

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-27 11:06:03 +02:00
jklincn
0708835301 ggml-backend : fix backend search path (llama/12330)
* Fix backend search path

* replace .native() with '/'

* reverted .native()
2025-03-27 11:06:03 +02:00
BB-fat
774c519433 metal : Cache the Metal library at the device context level (llama/12265) 2025-03-27 11:06:03 +02:00
Eve
776cdceb9e mat vec double buffer (llama/12188) 2025-03-27 11:06:03 +02:00
R0CKSTAR
03d050481e musa: support new arch mp_31 and update doc (llama/12296)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
Henry Linjamäki
3d60219622 opencl: use OpenCL C standard supported by the device (llama/12221)
This patch nudges the llama.cpp a bit to be supported on PoCL which
doesn't support OpenCL C CL2.0. The issue is solved by querying the
device for the supported OpenCL C versions and using the highest one
available.
2025-03-27 11:06:03 +02:00
Jason C.H
521d72d76e ggml-backend : make path_str compatible with C++20 (llama/12269) 2025-03-27 11:06:03 +02:00
Daniel Bevenius
9fb9025a40 ggml : skip intermediate .air file when compiling .metallib (llama/12247)
This commit updates the compilation of default.metallib to skip the
intermediate .air (Apple Intermediate Representation) file.

The motivation for this change is to simplify the custom command a
little and avoid generating and then removing the .air file.
2025-03-27 11:06:03 +02:00
Christian Kastner
3c2abb01e8 cmake: Enable specifying exact PowerPC CPU architecture (ggml/1138)
In the process, guard automatic CPU detection with GGML_NATIVE.

https://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html#index-mcpu-10
2025-03-27 11:06:03 +02:00
Christian Kastner
efd9407e22 cmake: Comment out GGML_BIN_DIR for now (ggml/1139)
Nothing installs to it yet, so when attempting to use the cmake package,
set_and_check() triggers an error if the directory doesn't already exist
for other reasons.
2025-03-27 11:06:03 +02:00