Johannes Gäßler
6db0e01db6
CUDA: fix race conditions FlashAttention kernels (llama/13438)
2025-05-13 13:59:21 +03:00
Johannes Gäßler
2d436bfbfb
CUDA: FA support for Deepseek (Ampere or newer) (llama/13306)
...
* CUDA: FA support for Deepseek (Ampere or newer)
* do loop unrolling via C++ template
2025-05-13 13:59:21 +03:00
R0CKSTAR
f9015b585b
musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON
in ci and update doc (llama/12611)
...
* musa: fix all warnings
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: update ci doc (install ccache)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* fix Windows build issue
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-31 14:56:53 +03:00
Gaurav Garg
ae6a9bb9a5
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)
...
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1
Fixes Issue: #12182
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-27 11:06:03 +02:00
Johannes Gäßler
38ac47cd4d
CUDA: app option to compile without FlashAttention (llama/12025)
2025-02-27 08:55:36 +02:00
Johannes Gäßler
2d70cd36d7
CUDA: optimize FA for GQA + large batches (llama/12014)
2025-02-27 08:55:36 +02:00
Johannes Gäßler
f8a831779e
CUDA: use mma PTX instructions for FlashAttention (llama/11583)
...
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-03 22:00:57 +02:00
mahorozte
4af9626702
CUDA: remove unnecessary warp reduce in FA (ggml/1032)
...
* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit
* same problem in vec32
---------
Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>
2024-12-08 20:14:35 +02:00
Diego Devesa
746bf2596f
ggml : build backends as libraries (llama/10256)
...
* ggml : build backends as libraries
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-20 21:00:08 +02:00
Johannes Gäßler
936cf3beb7
ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)
2024-10-05 15:23:51 +03:00
Johannes Gäßler
24d8534bd8
CPU/CUDA: Gemma 2 FlashAttention support (llama/8542)
...
* CPU/CUDA: Gemma 2 FlashAttention support
* apply logit_softcap to scale in kernel
* disable logit softcapping tests on Metal
* remove metal check
2024-08-28 13:22:20 +03:00
Georgi Gerganov
e30c679928
whisper : reorganize source code + improve CMake ( #2256 )
...
* scripts : update sync [no ci]
* files : reorganize [no ci]
* sync : llama.cpp
* cmake : link math library
* cmake : build normal ggml library
* files : move headers to include
* objc : fix path to ggml-metal.h
* ci : fix WHISPER_CUDA -> GGML_CUDA
* scripts : sync LICENSE [no ci]
2024-06-26 19:34:09 +03:00