38 Commits

Author SHA1 Message Date
Daniel Bevenius
09846f4e12
whisper: remove MSVC warnings pragmas (#3090)
Some checks failed
CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Has been cancelled
CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Has been cancelled
CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled
CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Has been cancelled
CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Has been cancelled
CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Has been cancelled
CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Has been cancelled
CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Has been cancelled
CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Has been cancelled
CI / emscripten (Release) (push) Has been cancelled
CI / android (push) Has been cancelled
CI / android_java (push) Has been cancelled
CI / quantize (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main-musa.Dockerfile platform:linux/amd64 tag:main-musa]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Has been cancelled
Examples WASM / deploy-wasm-github-pages (push) Has been cancelled
CI / ios-xcode-build (Release) (push) Has been cancelled
CI / bindings-java (push) Has been cancelled
CI / release (push) Has been cancelled
CI / coreml-base-en (push) Has been cancelled
* ggml : remove MSVC warnings pragmas

This commit removes the MSVC-specific pragmas as these are now handled
in CMakeLists.txt.

* whisper : remove MSVC warning pragmas

This commit removes the MSVC-specific pragmas. These are now handled in
the CMakeLists.txt file.
2025-05-05 13:09:35 +02:00
R0CKSTAR
7017c1d37d musa: fix typo in cc control (llama/13144)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-05-01 13:29:02 +03:00
Alan Gray
d1d847f184 Simplify and improve CUDA graphs through use of indirect copy pointers (llama/9017)
* CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers

Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.

Fixes #12152

* Addressed comments

* fix HIP builds

* properly sync to stream

* removed ggml_cuda_cpy_fn_ptrs

* move stream sync before free

* guard to only use indirection with graphs

* style fixes

* check for errors

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-04-24 20:39:16 +03:00
R0CKSTAR
f9015b585b musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (llama/12611)
* musa: fix all warnings

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: update ci doc (install ccache)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* fix Windows build issue

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-31 14:56:53 +03:00
Slobodan Josic
e0c43b0bbf HIP: Add support for RDNA4 targets (llama/12372) 2025-03-27 11:06:03 +02:00
R0CKSTAR
a219941812 CUDA: Fix clang warnings (llama/12540)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
R0CKSTAR
d487a28ae1 musa: refine compute capability (llama/12493)
* musa: refine compute capability

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
Gaurav Garg
cfc2560e41 cuda : enable CUDA Graph on CUDA Toolkit < 12.x (llama/12394)
* Enable CUDA Graph on CTK < 12.x

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

* Fix compilation errors with MUSA

* Disable CUDA Graph for MUSA
2025-03-27 11:06:03 +02:00
uvos
394fae57c3 CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (llama/12177)
refactor mmqv to unify the calculation of nwarps and rows per block between host and device code.

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-27 11:06:03 +02:00
David Huang
edd1d8686a HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (llama/12032)
Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check
Adds rocWMMA support to fattn-wmma-f16
2025-03-08 15:13:01 +02:00
Johannes Gäßler
38ac47cd4d CUDA: app option to compile without FlashAttention (llama/12025) 2025-02-27 08:55:36 +02:00
Bodhi
48f5e893f5 MUSA: support ARM64 and enable dp4a .etc (llama/11843)
* MUSA:  support ARM64 and enable __dp4a .etc

* fix cross entropy loss op for musa

* update

* add cc info log for musa

* add comment for the MUSA .cc calculation block

---------

Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>
2025-02-27 08:55:36 +02:00
Johannes Gäßler
51a3580c79 CUDA: use async data loading for FlashAttention (llama/11894)
* CUDA: use async data loading for FlashAttention

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-27 08:55:36 +02:00
Johannes Gäßler
556f773d53 CUDA: fix CUDART_VERSION checks (llama/11821) 2025-02-27 08:55:36 +02:00
Johannes Gäßler
1b67d72f87 CUDA: use arch list for compatibility check (llama/11775)
* CUDA: use arch list for feature availability check

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-27 08:55:36 +02:00
uvos
9906792ec3 CUDA/HIP: add support for selectable warp size to mmv (llama/11519)
CUDA/HIP: add support for selectable warp size to mmv
2025-02-03 22:00:57 +02:00
uvos
c49ee07ff4 HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (llama/11601)
This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly
2025-02-03 22:00:57 +02:00
Johannes Gäßler
f8a831779e CUDA: use mma PTX instructions for FlashAttention (llama/11583)
* CUDA: use mma PTX instructions for FlashAttention

* __shfl_sync workaround for movmatrix

* add __shfl_sync to HIP

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-03 22:00:57 +02:00
uvos
fc2e44490d HIP: Prepare reduction operators for wave 64 2025-02-03 22:00:57 +02:00
uvos
f41fdad200 CUDA/HIP: add warp_size to cuda_device_info 2025-02-03 22:00:57 +02:00
Haus1
028511d349 AMD: parse the architecture as supplied by gcnArchName (llama/11244)
The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.
2025-02-03 22:00:57 +02:00
uvos
a160fa0f3a Hip: disable VMM on hip as it seams that it dosent work in some configurations (llama/11420) 2025-02-03 22:00:57 +02:00
uvos
0282ad8fd1 hip : Add hipGraph and VMM support to ROCM (llama/11362)
* Add hipGraph support

* Enable VMM on rocm
2025-02-03 22:00:57 +02:00
Andreas Kieslinger
b82c8d76dc CUDA: rename macros to avoid conflicts with WinAPI (llama/10736)
* Renames NVIDIA GPU-architecture flags to avoid name clashes with WinAPI. (e.g. CC_PASCAL, GPU architecture or WinAPI pascal compiler flag?)

* Reverts erroneous rename in SYCL-code.

* Renames GGML_CUDA_MIN_CC_DP4A to GGML_CUDA_CC_DP4A.

* Renames the rest of the compute capability macros for consistency.
2024-12-18 12:52:16 +02:00
uvos
230e985633 Add some minimal optimizations for CDNA (llama/10498)
* Add some minimal optimizations for CDNA

* ggml_cuda: set launch bounds also for GCN as it helps there too
2024-12-08 20:14:35 +02:00
Diego Devesa
2a4b5c9d7e cuda : optimize argmax (llama/10441)
* cuda : optimize argmax

* remove unused parameter

ggml-ci

* fixup : use full warps

ggml-ci

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* fix ub

* ggml : check ne00 <= INT32_MAX in argmax and argsort

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-12-08 20:14:35 +02:00
Diego Devesa
746bf2596f ggml : build backends as libraries (llama/10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-20 21:00:08 +02:00
Johannes Gäßler
936cf3beb7 ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980) 2024-10-05 15:23:51 +03:00
R0CKSTAR
13f41af43e musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (llama/9526)
* mtgpu: add mp_21 support

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: enable unified memory

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-24 19:45:08 +03:00
agray3
f0a7d65b3d Update CUDA graph on scale change plus clear nodes/params (llama/9550)
* Avoid using saved CUDA graph if scale changes and reset nodes/params on update

Fixes https://github.com/ggerganov/llama.cpp/issues/9451

* clear before resize
2024-09-24 19:45:08 +03:00
R0CKSTAR
49ac8872b4 cuda : organize vendor-specific headers into vendors directory (llama/8746)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-08-08 22:48:46 +03:00
R0CKSTAR
e471adcfa5 feat: Support Moore Threads GPU (llama/8383)
* Update doc for MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in Makefile

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in CMake

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* CUDA => MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* MUSA adds support for __vsubss4

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Fix CI build failure

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-08-08 22:48:46 +03:00
slaren
dd916a2852 ggml : reduce hash table reset cost (llama/8698)
* ggml : reduce hash table reset cost

* fix unreachable code warnings after GGML_ASSERT(false)

* GGML_ASSERT(false) -> GGML_ABORT("fatal error")

* GGML_ABORT use format string
2024-08-08 22:48:46 +03:00
Jeroen Mostert
86506b0c5c Allow all RDNA2 archs to use sdot4 intrinsic (llama/8629)
The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.
2024-08-08 22:48:46 +03:00
Daniel Bevenius
84493d7f3e cuda : suppress 'noreturn' warn in no_device_code (llama/8414)
* cuda : suppress 'noreturn' warn in no_device_code

This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:

```console
/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
  346 | }
      | ^
```

The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! cuda : suppress 'noreturn' warn in no_device_code

Update __trap macro instead of using a while loop to suppress the
warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-08-08 22:48:46 +03:00
Daniele
95e90823d9 Define and optimize RDNA1 (llama/8085) 2024-07-08 14:53:55 +03:00
Johannes Gäßler
e4bc83ab47 CUDA: refactor and optimize IQ MMVQ (llama/8215)
* CUDA: refactor and optimize IQ MMVQ

* uint -> uint32_t

* __dp4a -> ggml_cuda_dp4a

* remove MIN_CC_DP4A checks

* change default

* try CI fix
2024-07-08 14:53:55 +03:00
Georgi Gerganov
e30c679928
whisper : reorganize source code + improve CMake (#2256)
* scripts : update sync [no ci]

* files : reorganize [no ci]

* sync : llama.cpp

* cmake : link math library

* cmake : build normal ggml library

* files : move headers to include

* objc : fix path to ggml-metal.h

* ci : fix WHISPER_CUDA -> GGML_CUDA

* scripts : sync LICENSE [no ci]
2024-06-26 19:34:09 +03:00