b4b063a5c9
vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (llama/11592)
2025-02-27 08:55:36 +02:00
930b739e7a
vulkan: account for lookup tables when checking shared memory size (llama/11502)
2025-02-27 08:55:36 +02:00
5981352bb5
ggml: Fix data race in ggml threadpool (llama/11736)
...
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.
Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
2025-02-27 08:55:36 +02:00
7561da244e
CUDA: fix min. version for movmatrix (llama/11751)
2025-02-27 08:55:36 +02:00
be83f342fb
vulkan: print shared memory size (llama/11719)
2025-02-27 08:55:36 +02:00
fd369871f7
SYCL: remove XMX info from print devices (llama/11712)
2025-02-27 08:55:36 +02:00
bbd8364f5e
ggml : optimize and build warning fix for LoongArch (llama/11709)
...
* ggml : optimize convert f32<->f16 for loongarch_asx
* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16
* ggml : Fix warnings when run cpu CI locally on LoongArch
2025-02-27 08:55:36 +02:00
e4102440ef
SYCL: Adjust support condition for norm operators (llama/11674)
...
SYCL does not support non contiguous tensors for norm operations
2025-02-27 08:55:36 +02:00
f8242ec483
ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)
2025-02-27 08:55:36 +02:00
ef51b4cba4
vulkan: optimize coopmat2 iq2/iq3 callbacks (llama/11521)
...
* vulkan: optimize coopmat2 iq2/iq3 callbacks
* build: trigger CI on GLSL compute shader changes
2025-02-27 08:55:36 +02:00
6f08b24146
vulkan: initial support for IQ4_XS quantization (llama/11501)
2025-02-27 08:55:36 +02:00
7c165d7fa8
vulkan: use smaller combined allocations to avoid fragmentation (llama/11551)
2025-02-27 08:55:36 +02:00
2f0cf44915
metal : avoid breaking build when metal API predates TARGET_OS_VISION (llama/11690)
...
Avoids breakage in nix flake build introduced by b0569130c5e9c671152c913d82803b7c2f014ff9
2025-02-27 08:55:36 +02:00
b9c972fd0d
metal : adjust support conditions for norm operators (llama/11671)
...
cont #11659
ggml-ci
2025-02-27 08:55:36 +02:00
01c9aafbfd
CUDA: support for mat. mul. with ne03 != ne13 (llama/11656)
2025-02-27 08:55:36 +02:00
bae6bbf487
CUDA: non-contiguous (RMS) norm support (llama/11659)
...
* CUDA: non-contiguous (RMS) norm support
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-02-27 08:55:36 +02:00
c310272fa0
HIP: force max threads per block to be 1024 (llama/11621)
...
Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm.
Signed-off-by: fxzjshm <fxzjshm@163.com >
2025-02-27 08:55:36 +02:00
bd0b55dbe0
metal : use residency set for other platforms (llama/11648)
2025-02-27 08:55:36 +02:00
ba4645db2c
rpc: fix known RCE in rpc-server (ggml/1103)
...
Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes
+ Check if `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.
2025-02-27 08:55:36 +02:00
46d07b9c85
cmake : fix compile assumptions for power9/etc ( #2777 )
...
CI / ubuntu-22-clang (linux/amd64, Debug) (push) Waiting to run
CI / ubuntu-22-clang (linux/amd64, Release) (push) Waiting to run
CI / ubuntu-22-clang (linux/arm64, Debug) (push) Waiting to run
CI / ubuntu-22-clang (linux/arm64, Release) (push) Waiting to run
CI / ubuntu-22-clang (linux/ppc64le, Debug) (push) Waiting to run
CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Waiting to run
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Waiting to run
CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Waiting to run
CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Waiting to run
CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Waiting to run
CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Waiting to run
CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Waiting to run
CI / emscripten (Release) (push) Waiting to run
CI / ios-xcode-build (Release) (push) Waiting to run
CI / android (push) Waiting to run
CI / quantize (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Waiting to run
* Add small comment re: VSX to readme
Co-authored-by: midnight <midnight@example.com >
2025-02-05 14:41:10 +02:00
dbeb7916b8
CUDA: fix Volta FlashAttention logic (llama/11615)
2025-02-03 22:00:57 +02:00
fad2806352
HIP: fix flash_attn_stream_k_fixup warning (llama/11604)
2025-02-03 22:00:57 +02:00
9906792ec3
CUDA/HIP: add support for selectable warp size to mmv (llama/11519)
...
CUDA/HIP: add support for selectable warp size to mmv
2025-02-03 22:00:57 +02:00
c49ee07ff4
HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (llama/11601)
...
This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly
2025-02-03 22:00:57 +02:00
f8a831779e
CUDA: use mma PTX instructions for FlashAttention (llama/11583)
...
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-03 22:00:57 +02:00
85451e3612
ci
: use sccache on windows instead of ccache (llama/11545)
...
* Use sccache on ci for windows
* Detect sccache in cmake
2025-02-03 22:00:57 +02:00
43c744ce8b
HIP: require at least HIP 5.5
2025-02-03 22:00:57 +02:00
fc2e44490d
HIP: Prepare reduction operators for wave 64
2025-02-03 22:00:57 +02:00
f41fdad200
CUDA/HIP: add warp_size to cuda_device_info
2025-02-03 22:00:57 +02:00
80fa576254
vulkan: implement initial support for IQ2 and IQ3 quantizations (llama/11360)
...
* vulkan: initial support for IQ3_S
* vulkan: initial support for IQ3_XXS
* vulkan: initial support for IQ2_XXS
* vulkan: initial support for IQ2_XS
* vulkan: optimize Q3_K by removing branches
* vulkan: implement dequantize variants for coopmat2
* vulkan: initial support for IQ2_S
* vulkan: vertically realign code
* port failing dequant callbacks from mul_mm
* Fix array length mismatches
* vulkan: avoid using workgroup size before it is referenced
* tests: increase timeout for Vulkan llvmpipe backend
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com >
2025-02-03 22:00:57 +02:00
75e7d0585e
vulkan: Catch pipeline creation failure and print an error message (llama/11436)
...
* vulkan: Catch pipeline creation failure and print an error message
Also, fix some warnings from my on-demand compile change.
* vulkan: fix pipeline creation logging
2025-02-03 22:00:57 +02:00
682a6f5f87
HIP: Supress transformation warning in softmax.cu
...
loops with bounds not known at compile time can not be unrolled.
when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.
2025-02-03 22:00:57 +02:00
115716d109
HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug (llama/11080)
...
This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.
2025-02-03 22:00:57 +02:00
b2cfef655b
cmake : don't fail on GGML_CPU=OFF
(llama/11457)
2025-02-03 22:00:57 +02:00
22e3df0afa
SYCL : SOFTMAX F16 mask support and other fixes (llama/11261)
...
Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021 .
To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it).
* SYCL: SOFTMAX F16 mask support and other fixes
* test-backend-ops: Add F16 mask test cases
2025-02-03 22:00:57 +02:00
028511d349
AMD: parse the architecture as supplied by gcnArchName (llama/11244)
...
The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.
2025-02-03 22:00:57 +02:00
70c4038842
metal: Handle null returned from MTLCreateSystemDefaultDevice() (llama/11441)
...
This fixes segmentation fault error when running tests when no metal
devices are available (for example, when not linked with Core Graphics
framework or otherwise).
2025-02-03 22:00:57 +02:00
8639c003a9
metal : use residency sets (llama/11427)
...
* metal : use residency sets
ggml-ci
* metal : restore commandBufferWithUnretainedReferences calls [no ci]
* metal : release descriptors
ggml-ci
* metal : check env GGML_METAL_NO_RESIDENCY
ggml-ci
* metal : fix build + clean-up
ggml-ci
2025-02-03 22:00:57 +02:00
d5d831da65
cmake: add ggml find package (llama/11369)
...
* Add initial ggml cmake package
* Add build numbers to ggml find-package
* Expand variables with GGML_ prefix
* Guard against adding to cache variable twice
* Add git to msys2 workflow
* Handle ggml-cpu-* variants
* Link ggml/ggml-base libraries to their targets
* Replace main-cmake-pkg with simple-cmake-pkg
* Interface features require c_std_90
* Fix typo
* Removed unnecessary bracket from status message
* Update examples/simple-cmake-pkg/README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update examples/simple-cmake-pkg/README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-02-03 22:00:57 +02:00
7230a6e1c8
vulkan: compile shaders on-demand (llama/11406)
...
Reduce first-run startup time and memory consumption.
Should fix #11339 .
2025-02-03 22:00:57 +02:00
a160fa0f3a
Hip: disable VMM on hip as it seams that it dosent work in some configurations (llama/11420)
2025-02-03 22:00:57 +02:00
0282ad8fd1
hip : Add hipGraph and VMM support to ROCM (llama/11362)
...
* Add hipGraph support
* Enable VMM on rocm
2025-02-03 22:00:57 +02:00
9e467815d4
CUDA: fix FP16 cuBLAS GEMM (llama/11396)
2025-02-03 22:00:57 +02:00
727891d9bf
rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (llama/11356)
2025-02-03 22:00:57 +02:00
c262dc80e2
CPU/CUDA: fix (GQA) mul mat back, add CUDA support (llama/11380)
2025-02-03 22:00:57 +02:00
16eeb31933
Vulkan-run-test: fix mmq_wg_denoms (llama/11343)
...
There should be a copy-and-paste error here.
*mmq_wg_denoms should be used together with *warptile_mmq, instead of
wg_denoms.
2025-02-03 22:00:57 +02:00
ba523d5e22
vulkan: sort shaders for more deterministic binary (llama/11315)
...
Fixes #11306 .
2025-02-03 22:00:57 +02:00
3736706139
vulkan: fix diag_mask_inf (llama/11323)
...
With robustbufferaccess disabled, this shader was showing OOB stores. There
is a bounds check in the code, but the workgrouop dimensions were reversed vs
CUDA and it was running the wrong number of threads. So fix the workgroup
dimensions and disable robustness for this pipeline.
2025-02-03 22:00:57 +02:00
58640aa456
rpc : better caching of the base buffer pointer (llama/11331)
...
There is no need to use map, just store the base pointer in the buffer
context.
2025-02-03 22:00:57 +02:00
5183a05e56
metal : fix out-of-bounds write (llama/11314)
...
ggml-ci
2025-02-03 22:00:57 +02:00