2357 Commits

Author SHA1 Message Date
Daniel Bevenius
5f75cae0b5 ci : fix whisper.dll path in build.yml 2025-03-28 08:48:16 +01:00
Daniel Bevenius
4c0c912176 ci : use arch for .dll names and enable jna debug 2025-03-28 08:38:19 +01:00
Daniel Bevenius
fa8c577b14 ci : fix List build release files step 2025-03-28 08:08:01 +01:00
Daniel Bevenius
956ceefd58 ci : fix copy of whiper.ddl to build\Release dir 2025-03-28 07:53:42 +01:00
Daniel Bevenius
36fa375b81 ci : add BUILD_SHARED_LIBS=ON windows build option 2025-03-27 19:59:10 +01:00
Daniel Bevenius
14ffc5e282 ci : copy SDL2.dll to build\Release\SDL2.dll 2025-03-27 19:27:53 +01:00
Daniel Bevenius
fdeea64b86 ci : fix path to SDL2.dll 2025-03-27 19:01:56 +01:00
Daniel Bevenius
95288a8f99 ci : fix sdl2.dll upload and download 2025-03-27 18:50:20 +01:00
Daniel Bevenius
2982bf72bb ci : move SDL2.dll upload to correct job 2025-03-27 18:09:58 +01:00
Daniel Bevenius
1b76698c9c ci : download SDL2.dll and copy it to the resources directory 2025-03-27 17:20:34 +01:00
Daniel Bevenius
f3c9030875 ci : add logging to debug JNA library loading 2025-03-27 16:37:11 +01:00
Daniel Bevenius
70f35b186d bindings.java : update destination path for native libraries 2025-03-27 15:57:53 +01:00
Daniel Bevenius
8b1661a667 ci : try copying the DLL to build/Release
The motivation for this is that there is a gradle task that copies the
dll from this location and hopefully this will work in github actions
too as I'm struggling to get this to work.
2025-03-27 15:39:30 +01:00
Daniel Bevenius
4f9a7dbb9b ci: move .dll to correct location bindings-java 2025-03-27 15:00:34 +01:00
Daniel Bevenius
7129bbfed9 squash! ci : re-enable bindings-java (java) job
Rename the downloaded (from github workflow storage) .dll to whisper.dll
as this is what WhisperCppJnaLibrary expects.
2025-03-27 14:37:29 +01:00
Daniel Bevenius
bfc213d2d0 squash! ci : re-enable bindings-java (java) job
Update directory for windows dll.
2025-03-27 14:02:03 +01:00
Daniel Bevenius
5b141a977e squash! ci : re-enable bindings-java (java) job
Add a condition to the bindings-java job to only run when the event is a
push, pull_request, or the run_type is full-ci.
2025-03-27 13:38:43 +01:00
Daniel Bevenius
0208803b66 ci : re-enable bindings-java (java) job
This commit re-enables the job previously name `java` which was
disabled in the build.yml file.

The motivation for this is that we recently fixed a few issue in the
java bindings and it should be possible to build them on windows.

Refs: https://github.com/ggerganov/whisper.cpp/pull/2949
Refs: https://github.com/ggerganov/whisper.cpp/issues/2781
2025-03-27 13:35:33 +01:00
Georgi Gerganov
f28bf5d186 xcf : fix visionOS build
Some checks are pending
CI / ubuntu-22-clang (linux/amd64, Release) (push) Waiting to run
CI / ubuntu-22-clang (linux/arm64, Debug) (push) Waiting to run
CI / ubuntu-22-clang (linux/arm64, Release) (push) Waiting to run
CI / ubuntu-22-clang (linux/ppc64le, Debug) (push) Waiting to run
CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Waiting to run
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Waiting to run
CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Waiting to run
CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Waiting to run
CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Waiting to run
CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Waiting to run
CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Waiting to run
CI / emscripten (Release) (push) Waiting to run
CI / ios-xcode-build (Release) (push) Blocked by required conditions
CI / android (push) Waiting to run
CI / quantize (push) Waiting to run
CI / release (push) Blocked by required conditions
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Waiting to run
ref: https://github.com/ggml-org/llama.cpp/pull/12415

ggml-ci
2025-03-27 11:06:03 +02:00
Georgi Gerganov
1fbdfb1d36 files : remove old wkv6 (#0)
ggml-ci
2025-03-27 11:06:03 +02:00
Georgi Gerganov
ee5581633b sync : ggml
ggml-ci
2025-03-27 11:06:03 +02:00
Georgi Gerganov
8ca67df291 ggml : sync/merge cmake,riscv,powerpc, add common.cmake (ggml/0) 2025-03-27 11:06:03 +02:00
amritahs-ibm
fc6d343e76 llamafile : ppc64le MMA implementation for Q4_0. (llama/12489)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le ISA using MMA
builtins. This patch handles matrix multiplication
between quantised datatypes, block_q4_0 and
block_q8_0.

This change results in 5% - 50% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2025-03-27 11:06:03 +02:00
Akarshan Biswas
3199356d3a SYCL: implement memset ggml backend buffer interface (llama/12580)
* SYCL: implement memset ggml backend buffer interface

* use GGML_ABORT macro

* Do not wait for all queues to finish for memset operation
2025-03-27 11:06:03 +02:00
Slobodan Josic
e0c43b0bbf HIP: Add support for RDNA4 targets (llama/12372) 2025-03-27 11:06:03 +02:00
Georgi Gerganov
f4f619ea8e metal : refactor mat-vec code (llama/12569)
* metal : refactor mat-vec code

ggml-ci

* metal : rename all_sum -> sum_all

ggml-ci

* metal : fix comments [no ci]

* metal : fix nr constant [no ci]

* metal : mv q6_K support nr0 > 1

ggml-ci

* metal : reduce register pressure

ggml-ci

* metal : fix typo [no ci]

* metal : reduce register pressure

ggml-ci
2025-03-27 11:06:03 +02:00
Georgi Gerganov
3c4d363872 ggml : fix MUL_MAT_ID repack with Q8_K (llama/12544)
* ggml : fix MUL_MAT_ID repack with Q8_K

ggml-ci

* ggml : improve repack templates

ggml-ci
2025-03-27 11:06:03 +02:00
Dan Johansson
15aa189329 ggml-cpu : update KleidiAI to v1.5.0 (llama/12568)
ggml-cpu : bug fix related to KleidiAI LHS packing

Signed-off-by: Dan Johansson <dan.johansson@arm.com>
2025-03-27 11:06:03 +02:00
Akarshan Biswas
c53d5c9e85 SYCL: disable Q4_0 reorder optimization (llama/12560)
ggml-ci
2025-03-27 11:06:03 +02:00
lhez
ba6f584f30 opencl: simplify kernel embedding logic in cmakefile (llama/12503)
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
2025-03-27 11:06:03 +02:00
R0CKSTAR
a219941812 CUDA: Fix clang warnings (llama/12540)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
Jeff Bolz
a2cc8c2666 vulkan: fix mul_mat_vec failure in backend tests (llama/12529)
The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
2025-03-27 11:06:03 +02:00
Georgi Gerganov
388ed98220 ggml : fix quantized cpy op (llama/12310)
* ggml : fix quantized cpy op

ggml-ci

* tests : add cpy tests for all types

ggml-ci

* tests : add BF16 copy tests

ggml-ci

* tests : fix loop for same-type copy

ggml-ci

* tests : add option to permute the dst tensor

ggml-ci
2025-03-27 11:06:03 +02:00
R0CKSTAR
d487a28ae1 musa: refine compute capability (llama/12493)
* musa: refine compute capability

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
Jeff Bolz
cbb88c4050 vulkan: Optimize mul_mat_vec p021 and nc shaders (llama/12505)
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.
2025-03-27 11:06:03 +02:00
stduhpf
13455c0b5f Vulkan: RTE rounding for cpy to quant (llama/12480)
* Vulkan: RTE rounding for cpy to quant

Co-Authored-By: Jeff Bolz <jbolz@nvidia.com>

* remove trailing whitespace

* avoid duplicating pipeline_cpy_f32_quant

* fix copypasting issue

* remove duplicated code

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-03-27 11:06:03 +02:00
Eve
2f77a9e9bd vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (llama/12472) 2025-03-27 11:06:03 +02:00
蕭澧邦
fa2b5249ff Fix build on Windows when ccache enabled (ggml/9954) (llama/9976)
* [SYCL] Fix build on Windows when ccache enabled (llama/9954)

* take effect only on windows and force it to icl

---------

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
2025-03-27 11:06:03 +02:00
Svetlozar Georgiev
5b854ebba5 sycl: cleanup oneDNN related code (llama/12097) 2025-03-27 11:06:03 +02:00
Srihari-mcw
8058f19d0b ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (llama/12332)
* Add block interleaving support for Q4_K quantization

* Remove whitespaces and fix CI/CD issues

* Update pointer of bsums from int16_t to const int16_t

* Add vector version of quantize_q8_K_4x8 function

* Update code formatting based on review comments
2025-03-27 11:06:03 +02:00
Gaurav Garg
ae6a9bb9a5 CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-27 11:06:03 +02:00
Jeff Bolz
24faba9e9b vulkan: optimize iq1 coopmat2 dequant functions (llama/12427) 2025-03-27 11:06:03 +02:00
Guus Waals
c722ff84d3 Fix visionOS build and add CI (llama/12415)
* ci: add visionOS build workflow

Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode.

* ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs

* ci: remove define hacks for u_xxx system types

---------

Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>
2025-03-27 11:06:03 +02:00
Jeff Bolz
102af79f63 vulkan: Submit once enough matmul work has been recorded (llama/12406)
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
2025-03-27 11:06:03 +02:00
lhez
03c364557d opencl: improve profiling (llama/12442)
* opencl: more profiling timing

* opencl: generate trace for profiling

* opencl: reduce profiling overhead

* Populate profiling timing info at the end rather than after each
  kernel run

* opencl: fix for chrome tracing
2025-03-27 11:06:03 +02:00
R0CKSTAR
31b62276cf musa: override warp_size of musa device to 32 (llama/12445)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
Łukasz Ślusarczyk
97b5a3055d SYCL: using graphs is configurable by environment variable and compile option (llama/12371)
* alberto changes

* enable sycl graphs by env variable

* fixed compilation warnings in ggml-sycl.cpp

* renamed graph variables

* fix markdown in docs/backend/SYCL.md

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>

* fix markdown in docs/backend/SYCL.md again

* compiling graphs by default, renamed graph_enable to graph_disable

---------

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
2025-03-27 11:06:03 +02:00
fj-y-saito
9993c3f703 ggml : add SVE support for q6_K_q8_K (llama/12361) 2025-03-27 11:06:03 +02:00
0cc4m
fa72479cfb Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (llama/12434) 2025-03-27 11:06:03 +02:00
Łukasz Ślusarczyk
6c15539c54 fixed compilation warnings in ggml-sycl (llama/12424) 2025-03-27 11:06:03 +02:00