Commit Graph

666 Commits

Author SHA1 Message Date
b4b063a5c9 vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (llama/11592) 2025-02-27 08:55:36 +02:00
930b739e7a vulkan: account for lookup tables when checking shared memory size (llama/11502) 2025-02-27 08:55:36 +02:00
5981352bb5 ggml: Fix data race in ggml threadpool (llama/11736)
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.

Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
2025-02-27 08:55:36 +02:00
7561da244e CUDA: fix min. version for movmatrix (llama/11751) 2025-02-27 08:55:36 +02:00
be83f342fb vulkan: print shared memory size (llama/11719) 2025-02-27 08:55:36 +02:00
fd369871f7 SYCL: remove XMX info from print devices (llama/11712) 2025-02-27 08:55:36 +02:00
bbd8364f5e ggml : optimize and build warning fix for LoongArch (llama/11709)
* ggml : optimize convert f32<->f16 for loongarch_asx

* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16

* ggml : Fix warnings when run cpu CI locally on LoongArch
2025-02-27 08:55:36 +02:00
e4102440ef SYCL: Adjust support condition for norm operators (llama/11674)
SYCL does not support non contiguous tensors for norm operations
2025-02-27 08:55:36 +02:00
f8242ec483 ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701) 2025-02-27 08:55:36 +02:00
ef51b4cba4 vulkan: optimize coopmat2 iq2/iq3 callbacks (llama/11521)
* vulkan: optimize coopmat2 iq2/iq3 callbacks

* build: trigger CI on GLSL compute shader changes
2025-02-27 08:55:36 +02:00
6f08b24146 vulkan: initial support for IQ4_XS quantization (llama/11501) 2025-02-27 08:55:36 +02:00
7c165d7fa8 vulkan: use smaller combined allocations to avoid fragmentation (llama/11551) 2025-02-27 08:55:36 +02:00
2f0cf44915 metal : avoid breaking build when metal API predates TARGET_OS_VISION (llama/11690)
Avoids breakage in nix flake build introduced by b0569130c5e9c671152c913d82803b7c2f014ff9
2025-02-27 08:55:36 +02:00
b9c972fd0d metal : adjust support conditions for norm operators (llama/11671)
cont #11659

ggml-ci
2025-02-27 08:55:36 +02:00
01c9aafbfd CUDA: support for mat. mul. with ne03 != ne13 (llama/11656) 2025-02-27 08:55:36 +02:00
bae6bbf487 CUDA: non-contiguous (RMS) norm support (llama/11659)
* CUDA: non-contiguous (RMS) norm support

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-27 08:55:36 +02:00
c310272fa0 HIP: force max threads per block to be 1024 (llama/11621)
Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm.

Signed-off-by: fxzjshm <fxzjshm@163.com>
2025-02-27 08:55:36 +02:00
bd0b55dbe0 metal : use residency set for other platforms (llama/11648) 2025-02-27 08:55:36 +02:00
ba4645db2c rpc: fix known RCE in rpc-server (ggml/1103)
Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes
+ Check if  `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.
2025-02-27 08:55:36 +02:00
46d07b9c85 cmake : fix compile assumptions for power9/etc (#2777)
Some checks are pending
CI / ubuntu-22-clang (linux/amd64, Debug) (push) Waiting to run
CI / ubuntu-22-clang (linux/amd64, Release) (push) Waiting to run
CI / ubuntu-22-clang (linux/arm64, Debug) (push) Waiting to run
CI / ubuntu-22-clang (linux/arm64, Release) (push) Waiting to run
CI / ubuntu-22-clang (linux/ppc64le, Debug) (push) Waiting to run
CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Waiting to run
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Waiting to run
CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Waiting to run
CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Waiting to run
CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Waiting to run
CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Waiting to run
CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Waiting to run
CI / emscripten (Release) (push) Waiting to run
CI / ios-xcode-build (Release) (push) Waiting to run
CI / android (push) Waiting to run
CI / quantize (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Waiting to run
* Add small comment re: VSX to readme

Co-authored-by: midnight <midnight@example.com>
2025-02-05 14:41:10 +02:00
dbeb7916b8 CUDA: fix Volta FlashAttention logic (llama/11615) 2025-02-03 22:00:57 +02:00
fad2806352 HIP: fix flash_attn_stream_k_fixup warning (llama/11604) 2025-02-03 22:00:57 +02:00
9906792ec3 CUDA/HIP: add support for selectable warp size to mmv (llama/11519)
CUDA/HIP: add support for selectable warp size to mmv
2025-02-03 22:00:57 +02:00
c49ee07ff4 HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (llama/11601)
This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly
2025-02-03 22:00:57 +02:00
f8a831779e CUDA: use mma PTX instructions for FlashAttention (llama/11583)
* CUDA: use mma PTX instructions for FlashAttention

* __shfl_sync workaround for movmatrix

* add __shfl_sync to HIP

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-03 22:00:57 +02:00
85451e3612 ci: use sccache on windows instead of ccache (llama/11545)
* Use sccache on ci for windows

* Detect sccache in cmake
2025-02-03 22:00:57 +02:00
43c744ce8b HIP: require at least HIP 5.5 2025-02-03 22:00:57 +02:00
fc2e44490d HIP: Prepare reduction operators for wave 64 2025-02-03 22:00:57 +02:00
f41fdad200 CUDA/HIP: add warp_size to cuda_device_info 2025-02-03 22:00:57 +02:00
80fa576254 vulkan: implement initial support for IQ2 and IQ3 quantizations (llama/11360)
* vulkan: initial support for IQ3_S

* vulkan: initial support for IQ3_XXS

* vulkan: initial support for IQ2_XXS

* vulkan: initial support for IQ2_XS

* vulkan: optimize Q3_K by removing branches

* vulkan: implement dequantize variants for coopmat2

* vulkan: initial support for IQ2_S

* vulkan: vertically realign code

* port failing dequant callbacks from mul_mm

* Fix array length mismatches

* vulkan: avoid using workgroup size before it is referenced

* tests: increase timeout for Vulkan llvmpipe backend

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-02-03 22:00:57 +02:00
75e7d0585e vulkan: Catch pipeline creation failure and print an error message (llama/11436)
* vulkan: Catch pipeline creation failure and print an error message

Also, fix some warnings from my on-demand compile change.

* vulkan: fix pipeline creation logging
2025-02-03 22:00:57 +02:00
682a6f5f87 HIP: Supress transformation warning in softmax.cu
loops with bounds not known at compile time can not be unrolled.
when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.
2025-02-03 22:00:57 +02:00
115716d109 HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug (llama/11080)
This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.
2025-02-03 22:00:57 +02:00
b2cfef655b cmake : don't fail on GGML_CPU=OFF (llama/11457) 2025-02-03 22:00:57 +02:00
22e3df0afa SYCL : SOFTMAX F16 mask support and other fixes (llama/11261)
Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021.
To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it).

* SYCL: SOFTMAX F16 mask support and other fixes

* test-backend-ops: Add F16 mask test cases
2025-02-03 22:00:57 +02:00
028511d349 AMD: parse the architecture as supplied by gcnArchName (llama/11244)
The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.
2025-02-03 22:00:57 +02:00
70c4038842 metal: Handle null returned from MTLCreateSystemDefaultDevice() (llama/11441)
This fixes segmentation fault error when running tests when no metal
devices are available (for example, when not linked with Core Graphics
framework or otherwise).
2025-02-03 22:00:57 +02:00
8639c003a9 metal : use residency sets (llama/11427)
* metal : use residency sets

ggml-ci

* metal : restore commandBufferWithUnretainedReferences calls [no ci]

* metal : release descriptors

ggml-ci

* metal : check env GGML_METAL_NO_RESIDENCY

ggml-ci

* metal : fix build + clean-up

ggml-ci
2025-02-03 22:00:57 +02:00
d5d831da65 cmake: add ggml find package (llama/11369)
* Add initial ggml cmake package

* Add build numbers to ggml find-package

* Expand variables with GGML_ prefix

* Guard against adding to cache variable twice

* Add git to msys2 workflow

* Handle ggml-cpu-* variants

* Link ggml/ggml-base libraries to their targets

* Replace main-cmake-pkg with simple-cmake-pkg

* Interface features require c_std_90

* Fix typo

* Removed unnecessary bracket from status message

* Update examples/simple-cmake-pkg/README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/simple-cmake-pkg/README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-03 22:00:57 +02:00
7230a6e1c8 vulkan: compile shaders on-demand (llama/11406)
Reduce first-run startup time and memory consumption.

Should fix #11339.
2025-02-03 22:00:57 +02:00
a160fa0f3a Hip: disable VMM on hip as it seams that it dosent work in some configurations (llama/11420) 2025-02-03 22:00:57 +02:00
0282ad8fd1 hip : Add hipGraph and VMM support to ROCM (llama/11362)
* Add hipGraph support

* Enable VMM on rocm
2025-02-03 22:00:57 +02:00
9e467815d4 CUDA: fix FP16 cuBLAS GEMM (llama/11396) 2025-02-03 22:00:57 +02:00
727891d9bf rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (llama/11356) 2025-02-03 22:00:57 +02:00
c262dc80e2 CPU/CUDA: fix (GQA) mul mat back, add CUDA support (llama/11380) 2025-02-03 22:00:57 +02:00
16eeb31933 Vulkan-run-test: fix mmq_wg_denoms (llama/11343)
There should be a copy-and-paste error here.

*mmq_wg_denoms should be used together with *warptile_mmq, instead of
wg_denoms.
2025-02-03 22:00:57 +02:00
ba523d5e22 vulkan: sort shaders for more deterministic binary (llama/11315)
Fixes #11306.
2025-02-03 22:00:57 +02:00
3736706139 vulkan: fix diag_mask_inf (llama/11323)
With robustbufferaccess disabled, this shader was showing OOB stores. There
is a bounds check in the code, but the workgrouop dimensions were reversed vs
CUDA and it was running the wrong number of threads. So fix the workgroup
dimensions and disable robustness for this pipeline.
2025-02-03 22:00:57 +02:00
58640aa456 rpc : better caching of the base buffer pointer (llama/11331)
There is no need to use map, just store the base pointer in the buffer
context.
2025-02-03 22:00:57 +02:00
5183a05e56 metal : fix out-of-bounds write (llama/11314)
ggml-ci
2025-02-03 22:00:57 +02:00