Daniel Bevenius
1c20f46887
ci : enable bindings java job ( #3070 )
...
CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Has been cancelled
CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Has been cancelled
CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Has been cancelled
CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled
CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Has been cancelled
CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Has been cancelled
CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Has been cancelled
CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Has been cancelled
CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Has been cancelled
CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Has been cancelled
CI / emscripten (Release) (push) Has been cancelled
CI / ios-xcode-build (Release) (push) Has been cancelled
CI / android (push) Has been cancelled
CI / android_java (push) Has been cancelled
CI / bindings-java (push) Has been cancelled
CI / quantize (push) Has been cancelled
CI / release (push) Has been cancelled
CI / coreml-base-en (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Has been cancelled
Examples WASM / deploy-wasm-github-pages (push) Has been cancelled
* ci : re-enable bindings-java (java) job
This commit re-enables the job previously name `java` which was
disabled in the build.yml file.
The motivation for this is that we recently fixed a few issue in the
java bindings and it should be possible to build them on windows.
Refs: https://github.com/ggerganov/whisper.cpp/pull/2949
Resolves: https://github.com/ggerganov/whisper.cpp/issues/2781
2025-04-25 14:56:06 +02:00
Georgi Gerganov
adaea088bc
ruby : add cmake option ( #0 )
CI / ubuntu-22-clang (linux/ppc64le, Debug) (push) Has been cancelled
CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Has been cancelled
CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Has been cancelled
CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Has been cancelled
CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled
CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Has been cancelled
CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Has been cancelled
CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Has been cancelled
CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Has been cancelled
CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Has been cancelled
CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Has been cancelled
CI / emscripten (Release) (push) Has been cancelled
CI / ios-xcode-build (Release) (push) Has been cancelled
CI / android (push) Has been cancelled
CI / android_java (push) Has been cancelled
CI / quantize (push) Has been cancelled
CI / release (push) Has been cancelled
CI / coreml-base-en (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Has been cancelled
Examples WASM / deploy-wasm-github-pages (push) Has been cancelled
2025-04-24 20:39:16 +03:00
Georgi Gerganov
6c0d843f9d
cuda : fix unused variable compile warning ( #0 )
...
ggml-ci
2025-04-24 20:39:16 +03:00
Georgi Gerganov
efb800557f
sync : ggml
...
ggml-ci
2025-04-24 20:39:16 +03:00
Georgi Gerganov
337becefb9
opencl : remove obsolete files (skip) (ggml/1200)
2025-04-24 20:39:16 +03:00
Georgi Gerganov
11ae30c19e
sync : ggml
2025-04-24 20:39:16 +03:00
lhez
88c3cecd43
opencl: split ggml-opencl.cl into multiple files and cleanup (llama/12886)
...
---------
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
2025-04-24 20:39:16 +03:00
Georgi Gerganov
fe4acb33e3
ggml : fix trailing whitespaces (llama/0)
2025-04-24 20:39:16 +03:00
Johannes Gäßler
fd5a3e1bc6
CUDA: use switch statements in constexpr functions (llama/13095)
2025-04-24 20:39:16 +03:00
Georgi Gerganov
01e1600edd
metal : fix floating-point range of attention scores in FA kernels (llama/13090)
...
ggml-ci
2025-04-24 20:39:16 +03:00
Eve
cf3eb291ab
vulkan: matmul gcn tuning (llama/13016)
...
* tune matmul for gcn
* this one is more power efficient
* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp
Co-authored-by: 0cc4m <picard12@live.de>
* disable this tune for the proprietary driver
---------
Co-authored-by: 0cc4m <picard12@live.de>
2025-04-24 20:39:16 +03:00
Johannes Gäßler
3d54b68ea7
CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (llama/13014)
...
* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID
* fix logic for RoPE support, CUDA graphs
2025-04-24 20:39:16 +03:00
Diego Devesa
11218294db
ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (llama/12871)
...
* ggml : add SSE 4.2 variant for CPUs without AVX
* ggml : add x64 base ABI variant
2025-04-24 20:39:16 +03:00
Akarshan Biswas
33c89ade7d
SYCL: Add non-contiguous support in ROPE (llama/12993)
...
ggml-ci
2025-04-24 20:39:16 +03:00
Jeff Bolz
27a56e7243
vulkan: support noncontiguous rms_norm (llama/13031)
2025-04-24 20:39:16 +03:00
Jeffrey Morgan
f4ca3e2f9c
metal: add neg operator (llama/13029)
2025-04-24 20:39:16 +03:00
Akarshan Biswas
0287a5c51b
SYCL: Refactor and enable FP16 in binary broadcast OPs (llama/12975)
...
* SYCL: refactor move to a separate file
* Fix binbcast
* Remove duplicates
* fix include formatting
* fix typo
2025-04-24 20:39:16 +03:00
Radoslav Gerganov
24d29c55df
rpc : add RPC_CMD_HELLO (llama/12955)
...
Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org
Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465
2025-04-24 20:39:16 +03:00
Georgi Gerganov
36019c35a3
graph : make FA compatible with MLA + add initial Metal kernels (llama/12953)
...
* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci
2025-04-24 20:39:16 +03:00
Alan Gray
4e936e2afa
ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (llama/12970)
2025-04-24 20:39:16 +03:00
hipudding
314ce5981e
CANN: Add support for async operator submission (llama/12864)
...
Submit operators using asynchronous threads to improve performance.
Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.
Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.
2025-04-24 20:39:16 +03:00
kimminsu
cb7642b0f5
opencl: fix incorrect local_size index in profiling log (llama/12868)
2025-04-24 20:39:16 +03:00
Jeff Bolz
7db8f278f0
vulkan: enable coopmat2 FA gqa and split_k optimizations more often (llama/12931)
...
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.
split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
2025-04-24 20:39:16 +03:00
Chenguang Li
be42a19eab
CANN: Add 310P operator support check (llama/12962)
2025-04-24 20:39:16 +03:00
Georgi Gerganov
b8755670ca
metal : add FA-vec kernels for head size 96 (llama/12952)
...
ggml-ci
2025-04-24 20:39:16 +03:00
hipudding
483eecae62
CANN: Add x86 build ci (llama/12950)
...
* CANN: Add x86 build ci
* CANN: fix code format
2025-04-24 20:39:16 +03:00
David Huang
43e3d25d93
CUDA/HIP: Share the same unified memory allocation logic. (llama/12934)
...
Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.
2025-04-24 20:39:16 +03:00
Akarshan Biswas
e1dbf9a42e
SYCL: Add ROPE vision kernel (llama/12887)
...
* SYCL: Add ROPE vision kernel
* Add comment about rope mode
2025-04-24 20:39:16 +03:00
Srihari-mcw
ee0013865d
ggml : Add AVX512 implementation of GEMM - Q4_Kx8 (llama/12829)
...
* Add AVX512 implementation of GEMM - q4kx8
* Update changes to remove unnecessary whitespaces
2025-04-24 20:39:16 +03:00
Chenguang Li
32a407166b
CANN: Opt ROPE optimization (llama/12865)
...
* [CANN]Opt ROPE optimization
* [CANN]Codestyle adjustment
* [CANN]Fix the ROPE precision issue
* [CANN]codestyle fix
* [CANN]add rope unsupport case
Signed-off-by: noemotiovon <noemotiovon@gmail.com>
2025-04-24 20:39:16 +03:00
Xinpeng Dou
622f981853
CANN: Optimize CANN buffer pool memory management (llama/12875)
...
Multiple optional memory pools are provided for CANN, including VMM,
priority queue-based, and traditional memory pools.
1.When the memory pool is available and GGML_CANN_DISABLE_VMM_POOL
is not defined, the VMM pool is selected by default.
2.Otherwise, if GGML_CANN_ENABLE_BUF_PRIO_POOL is defined,
the priority queue-based memory pool is used.
3.If neither condition is met, the default memory pool is used.
2025-04-24 20:39:16 +03:00
Akarshan Biswas
d049d67065
SYCL: Fix im2col (llama/12910)
...
* SYCL: Fix im2col
* restore local workgroup size adjustments for large inputs
* restore format
2025-04-24 20:39:16 +03:00
Radoslav Gerganov
877308838e
rpc : use ggml_context_ptr (llama/12938)
2025-04-24 20:39:16 +03:00
Acly
d87dfcf7c0
ggml : Depthwise 2D convolution (ggml/1152)
...
* ggml-cpu : kernels for faster depthwise 2D convolution
* fix compile: remove static after moving to ops.cpp
* add dilation for depthwise_conv_2d
* review: rename to ggml_conv_2d_dw_direct, remove redundant struct keywords, pass by ref, whitespace
* review: rename depthwise_conv_2d -> conv_2d_dw everywhere
2025-04-24 20:39:16 +03:00
SXX
915c14ef10
ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register (llama/12773)
...
* ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register
* simplifies the codebase by removing redundant functions
2025-04-24 20:39:16 +03:00
Alan Gray
5d33d3c929
ggml: disable CUDA graphs for unsupported DUP and CONT node types (llama/12891)
...
Fixes #12798
2025-04-24 20:39:16 +03:00
Jeff Bolz
751e42b21e
vulkan: use aligned loads for flash attention mask (llama/12853)
...
Rewrite the stride logic for the mask tensor in the FA shader to force the
stride to be aligned, to allow using more efficient loads.
2025-04-24 20:39:16 +03:00
Ewan Crawford
e8ee32d12d
sycl: Support sycl_ext_oneapi_limited_graph (llama/12873)
...
The current usage of the SYCL-Graph extension checks for
the `sycl_ext_oneapi_graph` device aspect. However, it is also
possible to support `sycl_ext_oneapi_limied_graph` devices that
don't support update
2025-04-24 20:39:16 +03:00
Akarshan Biswas
e9ce285135
SYCL: Add fp16 type support to unary op kernels (llama/12788)
...
* SYCL: Add fp16 support to some elementwise OP kernels
* remove comment
ggml-ci
* Use static_cast directly
* remove not needed cast from tanh
* Use static cast and remove unneeded castings
* Adjust device_support_op for unary OPs
* Use cast_data and typed_data struct to deduplicate casting code
2025-04-24 20:39:16 +03:00
Aaron Teo
b942f451b6
ggml: fix compilation error s390x (llama/12848)
...
* ggml: fixes #12846 compilation error
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com>
* ggml: add documentation for code change
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com>
* ggml: refactor to type-cast and update documentation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com>
* ggml: update documentation to provide full issue link
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com>
---------
Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com>
2025-04-24 20:39:16 +03:00
cmdr2
e6410faf99
cpu: fix cpu backend's supports-op for GET_ROWS_BACK. fixes a fatal when running test-backend-ops with only the CPU backend (ggml/1190)
2025-04-24 20:39:16 +03:00
Chenguang Li
182df69384
CANN: Support more ops (llama/12841)
...
* [CANN]Support Opt LOG && MEAN && PAD_REFLECT_1D
* [CANN]Support COUNT_EQUAL && STEP && SGN
* [CANN]codestyle adjustment
* [CANN]codestyle adjustment
---------
Signed-off-by: noemotiovon <noemotiovon@gmail.com>
2025-04-24 20:39:16 +03:00
Prajwal B Mehendarkar
3bf9691dfd
Fixes #12823 (llama/12830)
...
* Including limits file on AIX
* Fixes #12823
2025-04-24 20:39:16 +03:00
Piotr Kubaj
ba444e9c23
ggml-cpu-impl.h: do not redefine bool on POWER9 (llama/12856)
...
error: unknown type name '_Bool'
2025-04-24 20:39:16 +03:00
Piotr Kubaj
c6caf8eef2
ggml-impl.h: fix build on POWER9 (llama/12855)
...
error: ISO C++17 does not allow 'register' storage class specifier
2025-04-24 20:39:16 +03:00
Chenguang Li
6cae79a1d7
CANN: Support Opt CONV_TRANSPOSE_1D and ELU (llama/12786)
...
* [CANN] Support ELU and CONV_TRANSPOSE_1D
* [CANN]Modification review comments
* [CANN]Modification review comments
* [CANN]name adjustment
* [CANN]remove lambda used in template
* [CANN]Use std::func instead of template
* [CANN]Modify the code according to the review comments
---------
Signed-off-by: noemotiovon <noemotiovon@gmail.com>
2025-04-24 20:39:16 +03:00
Jeff Bolz
b9bfe0c693
vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (llama/12833)
...
q4_k and q5_k had a lot of redundant global loads where the same 16B of
scale information is repeatedly loaded and decoded during each loop iteration.
This change restructures the loops to more explicitly iterate over whole
blocks in the outer loop (with unrolled inner loop) and to copy/decode the
scale data into shared memory once at the start of each outer loop. The copy
is pipelined so the scale load from global memory is relatively cheap.
This improves q4_k/q5_k model prompt processing performance by around 5-7%.
I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k
and hurt for q4_0.
The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped
variants isn't used as often as it originally was (e.g. due to the padded_N
change), so I trimmed it down to offset some of the new complexity of the
semi-manual loop unrolling.
2025-04-24 20:39:16 +03:00
Jeff Bolz
1d50c6ac22
vulkan: Use fp16 for the flash attention P*V multiplication (llama/12783)
...
This is consistent with the ggml-cuda behavior and the mul_mat fallback.
2025-04-24 20:39:16 +03:00
Sigbjørn Skjæret
79f23d9132
cuda : add f32 to bf16 copy op (llama/12806)
...
This allows BF16 KV-cache on CUDA.
2025-04-24 20:39:16 +03:00
Georgi Gerganov
ee2cbeeb74
llama : fix FA when KV cache is not used (i.e. embeddings) (llama/12825)
...
* ggml : FA supports F32 V
* graph : cast KV to F16 when the KV cache is not used
ggml-ci
* server : add test that exercises embeddings with FA enabled
ggml-ci
2025-04-24 20:39:16 +03:00