99596d6031
ggml-cpu : set openmp wait time if not set (llama/13758)
2025-05-27 18:03:00 +03:00
2d6c6862f7
ggml : add ggml_gelu_erf() CUDA kernel (llama/13719)
...
* ggml : add ggml_gelu_erf() CUDA kernel
* missing semicolon
2025-05-27 18:03:00 +03:00
f1576b2659
CUDA: fix race condition in FA vector kernels (llama/13742)
2025-05-27 18:03:00 +03:00
994b4f86ab
CANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705)
...
* [CANN]Support MUL_MAT_ID Q8 && Q4
Signed-off-by: noemotiovon <757486878@qq.com >
* codestyle adjustment
Signed-off-by: noemotiovon <757486878@qq.com >
---------
Signed-off-by: noemotiovon <757486878@qq.com >
2025-05-27 18:03:00 +03:00
3e7eaccf55
ggml : fix the order of ggml_unary_op (llama/13718)
2025-05-27 18:03:00 +03:00
191f040414
vulkan: support CPY from any type to itself (llama/13695)
...
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
2025-05-27 18:03:00 +03:00
2d49d4a9b5
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (llama/13696)
2025-05-27 18:03:00 +03:00
000d65befb
use LOG_WARN to replace std::cerr
(llama/13657)
2025-05-27 18:03:00 +03:00
f0803e6646
sycl : Remove waits from function calls (llama/13702)
...
* removes the waits in async memcpy functions
2025-05-27 18:03:00 +03:00
730a00be8a
SYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587)
...
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.
* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074
We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458)
)
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
2025-05-27 18:03:00 +03:00
316600e8ee
opencl: Add support for multiple devices (llama/12622)
...
* opencl: Add support for multiple devices
... but limited to one platform. A platform with a GPU will be preferred.
Additionally:
* Filter out devices that lack capabilities needed by the backend
implementation (half support, OpenCL 2.0+, etc).
* Make ggml_backend_opencl_reg() thread-safe.
* fixup: fix an error in sync_with_other_backends
... when there is only one OpenCL device available.
2025-05-27 18:03:00 +03:00
42f2b3bb65
opencl: fix couple crashes (llama/12795)
...
* opencl: fix couple crashes
* fix kernel launches failed on devices which do not support
non-uniform work-groups. When non-uniform work-groups are not
supported, set `local_work_size` to NULL (= let driver choose the
work-group sizes). This patch does not cover everything - just the
cases tested by test-backend-ops.
* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.
* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
2025-05-27 18:03:00 +03:00
dd6ef64060
ggml : add ggml_gelu_erf() (llama/13667)
...
* ggml : add ggml_gelu_na (not approximated)
* fix naming order
* rename na --> erf
* apply review suggesions
* revert naming order
2025-05-27 18:03:00 +03:00
131ee546ca
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (llama/13647)
...
* musa: fix build warning (unused parameter)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: upgrade MUSA SDK version to rc4.0.1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Update ggml/src/ggml-cuda/cpy.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2025-05-27 18:03:00 +03:00
4712f7b663
vulkan: fix warnings (llama/13626)
...
* small fixes
* remove ifdef
2025-05-27 18:03:00 +03:00
926fe234e9
CUDA: skip fully masked-out KV in FA vec kernel (llama/13584)
...
* CUDA: skip fully masked-out KV in FA vec kernel
2025-05-27 18:03:00 +03:00
f44b53480f
sycl: disable reorder for sycl mulmat (llama/13536)
2025-05-27 18:03:00 +03:00
e04e8f1c79
metal : fix typo in FA kernel comments (llama/13651)
2025-05-27 18:03:00 +03:00
ee3f177cba
sycl : Overcoming workaround for mmap() allocation on Windows (llama/13482)
...
* Remove mmap workaround on windows
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
* Update llama-bench README
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
2025-05-27 18:03:00 +03:00
0b69f74e15
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (llama/13607)
2025-05-27 18:03:00 +03:00
9da3fc27be
CANN: Support MOE Model MUL_MAT_ID (llama/13042)
...
Signed-off-by: noemotiovon <757486878@qq.com >
2025-05-19 14:58:39 +03:00
2c13651e08
cmake: use the current build config for vulkan-shaders-gen (llama/13595)
...
* fix: use the current build config for `vulkan-shaders-gen`
* fix: only pass a valid build type to `--config`
2025-05-19 14:58:39 +03:00
13dca86c56
vulkan: move common FA code to flash_attn_base.comp (llama/13556)
...
* vulkan: move common FA code to flash_attn_base.comp
* vulkan: move common FA index/stride setup code to flash_attn_base.comp
* build fix
2025-05-19 14:58:39 +03:00
6d61a09bc4
vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554)
2025-05-19 14:58:39 +03:00
4fedad988b
metal : add FA-vec kernel for head size 64 (llama/13583)
...
ggml-ci
2025-05-19 14:58:39 +03:00
a8e17a244d
sycl : fixed compilation warnings (llama/13582)
2025-05-19 14:58:39 +03:00
0c76acd08a
gguf : use ggml log system (llama/13571)
...
* gguf : use ggml log system
* llama : remove unnecessary new lines in exception messages
2025-05-19 14:58:39 +03:00
27964db1be
sycl: simplify bin_bcast_kernel (llama/13383)
2025-05-19 14:58:39 +03:00
8081e7a23d
sycl: reordered Q4_K MMVQ (llama/13109)
2025-05-19 14:58:39 +03:00
d807c497a4
sycl: use oneDNN for matrices multiplication (llama/12972)
2025-05-19 14:58:39 +03:00
8e9bf548f4
arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)
...
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 78.52 | 109.18 | 18.63 | 18.88 |
| 128 | 128 | 2 | 84.62 | 123.94 | 34.54 | 36.92 |
| 128 | 128 | 4 | 84.36 | 122.49 | 52.65 | 61.32 |
| 128 | 128 | 8 | 90.52 | 138.87 | 63.46 | 84.41 |
| 128 | 128 | 16 | 90.11 | 138.56 | 71.04 | 101.33 |
| 128 | 128 | 32 | 89.81 | 137.79 | 75.14 | 110.47 |
---------------------------------------------------------------------
```
2025-05-19 14:58:39 +03:00
0dda27bc0b
CUDA: fix crash on large batch size for quant. MoE (llama/13537)
2025-05-19 14:58:39 +03:00
ffa4720f25
CUDA: faster Deepseek FA, add Turing support (llama/13435)
2025-05-19 14:58:39 +03:00
9b8eea28b5
cmake: simplify vulkan shader test logic (llama/13263)
2025-05-19 14:58:39 +03:00
162bbe8220
vulkan: KHR_coopmat flash attention (llama/13506)
...
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
2025-05-19 14:58:39 +03:00
a221288dc6
vulkan: workaround FA compile failures on macos (llama/13517)
2025-05-19 14:58:39 +03:00
08436716ae
metal : use FA-vec kernel up to batch size 20 (llama/13496)
...
* batched-bench : fix pp batch contents
* metal : optimize multi-sequence FA vec kernel
ggml-ci
* metal : use FA-vec kernel up to batch size 20
ggml-ci
2025-05-19 14:58:39 +03:00
e11fc21e6c
metal : optimize multi-sequence FA vec kernel (llama/13493)
...
* batched-bench : fix pp batch contents
* metal : optimize multi-sequence FA vec kernel
ggml-ci
2025-05-19 14:58:39 +03:00
a77a924b20
ggml-cpu: Update KleidiAI to v1.6 and fix include directives (llama/13509)
...
Signed-off-by: Dan Johansson <dan.johansson@arm.com >
2025-05-19 14:58:39 +03:00
405b9c77ad
mnist: fix segmentation fault (ggml/1227)
2025-05-19 14:58:39 +03:00
9c3bfc1499
ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)
2025-05-19 14:58:39 +03:00
5b7797f674
ggml : Fix missing backtrace on Linux (ggml/1228)
...
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 14:58:39 +03:00
75e9a840c5
ggml : add mrope kernel for metal (llama/13457)
2025-05-13 13:59:21 +03:00
41ed62bdbc
metal : optimize MoE for large batches (llama/13388)
2025-05-13 13:59:21 +03:00
029c8837f8
opencl: remove unnecessary assert for add
(llama/13257)
2025-05-13 13:59:21 +03:00
5d8b068249
llama/ggml: add LLM training support (llama/10544)
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-13 13:59:21 +03:00
93ef22657e
ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (llama/13053)
...
* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel
Signed-off-by: Dan Johansson <dan.johansson@arm.com >
* * code review fixes
Signed-off-by: Dan Johansson <dan.johansson@arm.com >
* * adds a comment that clarifies barrier usage
Signed-off-by: Dan Johansson <dan.johansson@arm.com >
---------
Signed-off-by: Dan Johansson <dan.johansson@arm.com >
Co-authored-by: Charles Xu <charles.xu@arm.com >
2025-05-13 13:59:21 +03:00
866f685bbc
CUDA: fix misaligned synchronization in FA (llama/13469)
2025-05-13 13:59:21 +03:00
250bcc041a
enable dpcpp nightly builds with libraries (llama/13406)
2025-05-13 13:59:21 +03:00
90b17a99bf
CUDA: fix crash with partial offloading of MoE (llama/13439)
2025-05-13 13:59:21 +03:00