Daniel Bevenius
5f75cae0b5
ci : fix whisper.dll path in build.yml
2025-03-28 08:48:16 +01:00
Daniel Bevenius
4c0c912176
ci : use arch for .dll names and enable jna debug
2025-03-28 08:38:19 +01:00
Daniel Bevenius
fa8c577b14
ci : fix List build release files step
2025-03-28 08:08:01 +01:00
Daniel Bevenius
956ceefd58
ci : fix copy of whiper.ddl to build\Release dir
2025-03-28 07:53:42 +01:00
Daniel Bevenius
36fa375b81
ci : add BUILD_SHARED_LIBS=ON windows build option
2025-03-27 19:59:10 +01:00
Daniel Bevenius
14ffc5e282
ci : copy SDL2.dll to build\Release\SDL2.dll
2025-03-27 19:27:53 +01:00
Daniel Bevenius
fdeea64b86
ci : fix path to SDL2.dll
2025-03-27 19:01:56 +01:00
Daniel Bevenius
95288a8f99
ci : fix sdl2.dll upload and download
2025-03-27 18:50:20 +01:00
Daniel Bevenius
2982bf72bb
ci : move SDL2.dll upload to correct job
2025-03-27 18:09:58 +01:00
Daniel Bevenius
1b76698c9c
ci : download SDL2.dll and copy it to the resources directory
2025-03-27 17:20:34 +01:00
Daniel Bevenius
f3c9030875
ci : add logging to debug JNA library loading
2025-03-27 16:37:11 +01:00
Daniel Bevenius
70f35b186d
bindings.java : update destination path for native libraries
2025-03-27 15:57:53 +01:00
Daniel Bevenius
8b1661a667
ci : try copying the DLL to build/Release
...
The motivation for this is that there is a gradle task that copies the
dll from this location and hopefully this will work in github actions
too as I'm struggling to get this to work.
2025-03-27 15:39:30 +01:00
Daniel Bevenius
4f9a7dbb9b
ci: move .dll to correct location bindings-java
2025-03-27 15:00:34 +01:00
Daniel Bevenius
7129bbfed9
squash! ci : re-enable bindings-java (java) job
...
Rename the downloaded (from github workflow storage) .dll to whisper.dll
as this is what WhisperCppJnaLibrary expects.
2025-03-27 14:37:29 +01:00
Daniel Bevenius
bfc213d2d0
squash! ci : re-enable bindings-java (java) job
...
Update directory for windows dll.
2025-03-27 14:02:03 +01:00
Daniel Bevenius
5b141a977e
squash! ci : re-enable bindings-java (java) job
...
Add a condition to the bindings-java job to only run when the event is a
push, pull_request, or the run_type is full-ci.
2025-03-27 13:38:43 +01:00
Daniel Bevenius
0208803b66
ci : re-enable bindings-java (java) job
...
This commit re-enables the job previously name `java` which was
disabled in the build.yml file.
The motivation for this is that we recently fixed a few issue in the
java bindings and it should be possible to build them on windows.
Refs: https://github.com/ggerganov/whisper.cpp/pull/2949
Refs: https://github.com/ggerganov/whisper.cpp/issues/2781
2025-03-27 13:35:33 +01:00
Georgi Gerganov
f28bf5d186
xcf : fix visionOS build
...
CI / ubuntu-22-clang (linux/amd64, Release) (push) Waiting to run
CI / ubuntu-22-clang (linux/arm64, Debug) (push) Waiting to run
CI / ubuntu-22-clang (linux/arm64, Release) (push) Waiting to run
CI / ubuntu-22-clang (linux/ppc64le, Debug) (push) Waiting to run
CI / ubuntu-22-clang (linux/ppc64le, Release) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, ADDRESS) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, THREAD) (push) Waiting to run
CI / ubuntu-22-gcc-sanitized (linux/amd64, UNDEFINED) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Waiting to run
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Waiting to run
CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Waiting to run
CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Waiting to run
CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Waiting to run
CI / windows-cublas (x64, Release, ON, 11.8.0, ON, 2.28.5) (push) Waiting to run
CI / windows-cublas (x64, Release, ON, 12.2.0, ON, 2.28.5) (push) Waiting to run
CI / emscripten (Release) (push) Waiting to run
CI / ios-xcode-build (Release) (push) Blocked by required conditions
CI / android (push) Waiting to run
CI / quantize (push) Waiting to run
CI / release (push) Blocked by required conditions
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64 tag:main]) (push) Waiting to run
ref: https://github.com/ggml-org/llama.cpp/pull/12415
ggml-ci
2025-03-27 11:06:03 +02:00
Georgi Gerganov
1fbdfb1d36
files : remove old wkv6 ( #0 )
...
ggml-ci
2025-03-27 11:06:03 +02:00
Georgi Gerganov
ee5581633b
sync : ggml
...
ggml-ci
2025-03-27 11:06:03 +02:00
Georgi Gerganov
8ca67df291
ggml : sync/merge cmake,riscv,powerpc, add common.cmake (ggml/0)
2025-03-27 11:06:03 +02:00
amritahs-ibm
fc6d343e76
llamafile : ppc64le MMA implementation for Q4_0. (llama/12489)
...
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le ISA using MMA
builtins. This patch handles matrix multiplication
between quantised datatypes, block_q4_0 and
block_q8_0.
This change results in 5% - 50% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2025-03-27 11:06:03 +02:00
Akarshan Biswas
3199356d3a
SYCL: implement memset ggml backend buffer interface (llama/12580)
...
* SYCL: implement memset ggml backend buffer interface
* use GGML_ABORT macro
* Do not wait for all queues to finish for memset operation
2025-03-27 11:06:03 +02:00
Slobodan Josic
e0c43b0bbf
HIP: Add support for RDNA4 targets (llama/12372)
2025-03-27 11:06:03 +02:00
Georgi Gerganov
f4f619ea8e
metal : refactor mat-vec code (llama/12569)
...
* metal : refactor mat-vec code
ggml-ci
* metal : rename all_sum -> sum_all
ggml-ci
* metal : fix comments [no ci]
* metal : fix nr constant [no ci]
* metal : mv q6_K support nr0 > 1
ggml-ci
* metal : reduce register pressure
ggml-ci
* metal : fix typo [no ci]
* metal : reduce register pressure
ggml-ci
2025-03-27 11:06:03 +02:00
Georgi Gerganov
3c4d363872
ggml : fix MUL_MAT_ID repack with Q8_K (llama/12544)
...
* ggml : fix MUL_MAT_ID repack with Q8_K
ggml-ci
* ggml : improve repack templates
ggml-ci
2025-03-27 11:06:03 +02:00
Dan Johansson
15aa189329
ggml-cpu : update KleidiAI to v1.5.0 (llama/12568)
...
ggml-cpu : bug fix related to KleidiAI LHS packing
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
2025-03-27 11:06:03 +02:00
Akarshan Biswas
c53d5c9e85
SYCL: disable Q4_0 reorder optimization (llama/12560)
...
ggml-ci
2025-03-27 11:06:03 +02:00
lhez
ba6f584f30
opencl: simplify kernel embedding logic in cmakefile (llama/12503)
...
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
2025-03-27 11:06:03 +02:00
R0CKSTAR
a219941812
CUDA: Fix clang warnings (llama/12540)
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
Jeff Bolz
a2cc8c2666
vulkan: fix mul_mat_vec failure in backend tests (llama/12529)
...
The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
2025-03-27 11:06:03 +02:00
Georgi Gerganov
388ed98220
ggml : fix quantized cpy op (llama/12310)
...
* ggml : fix quantized cpy op
ggml-ci
* tests : add cpy tests for all types
ggml-ci
* tests : add BF16 copy tests
ggml-ci
* tests : fix loop for same-type copy
ggml-ci
* tests : add option to permute the dst tensor
ggml-ci
2025-03-27 11:06:03 +02:00
R0CKSTAR
d487a28ae1
musa: refine compute capability (llama/12493)
...
* musa: refine compute capability
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
Jeff Bolz
cbb88c4050
vulkan: Optimize mul_mat_vec p021 and nc shaders (llama/12505)
...
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders
* vulkan: Optimize mul_mat_vec p021 and nc shaders.
These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).
Using subgroupAdd in the p021 shader also helps, use that conditionally.
2025-03-27 11:06:03 +02:00
stduhpf
13455c0b5f
Vulkan: RTE rounding for cpy to quant (llama/12480)
...
* Vulkan: RTE rounding for cpy to quant
Co-Authored-By: Jeff Bolz <jbolz@nvidia.com>
* remove trailing whitespace
* avoid duplicating pipeline_cpy_f32_quant
* fix copypasting issue
* remove duplicated code
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-03-27 11:06:03 +02:00
Eve
2f77a9e9bd
vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (llama/12472)
2025-03-27 11:06:03 +02:00
蕭澧邦
fa2b5249ff
Fix build on Windows when ccache enabled (ggml/9954) (llama/9976)
...
* [SYCL] Fix build on Windows when ccache enabled (llama/9954)
* take effect only on windows and force it to icl
---------
Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
2025-03-27 11:06:03 +02:00
Svetlozar Georgiev
5b854ebba5
sycl: cleanup oneDNN related code (llama/12097)
2025-03-27 11:06:03 +02:00
Srihari-mcw
8058f19d0b
ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (llama/12332)
...
* Add block interleaving support for Q4_K quantization
* Remove whitespaces and fix CI/CD issues
* Update pointer of bsums from int16_t to const int16_t
* Add vector version of quantize_q8_K_4x8 function
* Update code formatting based on review comments
2025-03-27 11:06:03 +02:00
Gaurav Garg
ae6a9bb9a5
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)
...
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1
Fixes Issue: #12182
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-27 11:06:03 +02:00
Jeff Bolz
24faba9e9b
vulkan: optimize iq1 coopmat2 dequant functions (llama/12427)
2025-03-27 11:06:03 +02:00
Guus Waals
c722ff84d3
Fix visionOS build and add CI (llama/12415)
...
* ci: add visionOS build workflow
Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode.
* ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs
* ci: remove define hacks for u_xxx system types
---------
Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>
2025-03-27 11:06:03 +02:00
Jeff Bolz
102af79f63
vulkan: Submit once enough matmul work has been recorded (llama/12406)
...
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
2025-03-27 11:06:03 +02:00
lhez
03c364557d
opencl: improve profiling (llama/12442)
...
* opencl: more profiling timing
* opencl: generate trace for profiling
* opencl: reduce profiling overhead
* Populate profiling timing info at the end rather than after each
kernel run
* opencl: fix for chrome tracing
2025-03-27 11:06:03 +02:00
R0CKSTAR
31b62276cf
musa: override warp_size of musa device to 32 (llama/12445)
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-27 11:06:03 +02:00
Łukasz Ślusarczyk
97b5a3055d
SYCL: using graphs is configurable by environment variable and compile option (llama/12371)
...
* alberto changes
* enable sycl graphs by env variable
* fixed compilation warnings in ggml-sycl.cpp
* renamed graph variables
* fix markdown in docs/backend/SYCL.md
Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
* fix markdown in docs/backend/SYCL.md again
* compiling graphs by default, renamed graph_enable to graph_disable
---------
Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
2025-03-27 11:06:03 +02:00
fj-y-saito
9993c3f703
ggml : add SVE support for q6_K_q8_K (llama/12361)
2025-03-27 11:06:03 +02:00
0cc4m
fa72479cfb
Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (llama/12434)
2025-03-27 11:06:03 +02:00
Łukasz Ślusarczyk
6c15539c54
fixed compilation warnings in ggml-sycl (llama/12424)
2025-03-27 11:06:03 +02:00