Commit Graph

1991 Commits

Author SHA1 Message Date
Ouadie EL FAROUKI
df2c364de7 Fixed dequant precision issues in Q4_1 and Q5_1 (llama/9711) 2024-10-05 15:23:51 +03:00
Diego Devesa
1acfadb721 ggml-backend : add device and backend reg interfaces (llama/9707)
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-10-05 15:23:51 +03:00
Alberto Cabrera Pérez
ea642144d2 Initial cmake support of SYCL for AMD GPUs (llama/9658)
sycl: initial cmake support of SYCL for AMD GPUs
2024-10-05 15:23:51 +03:00
Radoslav Gerganov
282a8654c4 vulkan : do not use tensor->extra (llama/9407)
* vulkan : do not use tensor->extra

This patch allows using the Vulkan backend with the RPC backend as
tensor->extra is no longer used.

Ref: #8536

* Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (llama/2)

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-10-05 15:23:51 +03:00
Johannes Gäßler
936cf3beb7 ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980) 2024-10-05 15:23:51 +03:00
Johannes Gäßler
bc92c2f8f0 ggml: refactor cross entropy loss CPU impl. (ggml/976) 2024-10-05 15:23:51 +03:00
Georgi Gerganov
f7d55e0614 scripts : sync ggml-backend.cpp 2024-10-05 15:23:51 +03:00
Georgi Gerganov
f62a546e03
whisper : fix excessive memory usage (#2443)
* whisper : fix KV cache allocation

* whisper : reduce memory overhead from unused input tensors
2024-10-05 12:36:40 +03:00
Rahul Vadhyar
2944cb72d9
examples : update dr_wav.h to newer version (#2449) 2024-10-04 11:04:51 +03:00
Georgi Gerganov
ccc2547210 talk-llama : sync llama.cpp 2024-10-03 12:22:17 +03:00
Georgi Gerganov
162a455402 metal : reduce command encoding overhead (llama/9698) 2024-10-03 12:22:17 +03:00
Georgi Gerganov
ff2cb0811f sync : ggml 2024-10-03 12:22:17 +03:00
Johannes Gäßler
5e9d6baa48 test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974) 2024-10-03 12:22:17 +03:00
Salvatore Mesoraca
845f8d663e vulkan : mul_mat: fix UB with small warps (ggml/952)
When the device's warp size is less than 16,
it is possible for loadstride_a (mul_mm.comp:114)
and loadstride_b (mul_mm.comp:115) to be set to 0.
Because they are calculated as: the workgroup size,
multiplied by LOAD_VEC_* (which can be 1) and divided by 16.
And the workgroup size is set to be the same as the
warp/subgroup size.

The loadstride_* variables are used as increments in the
loops that populate the buffers used for the multiplication.

When they are 0 they cause an infinite loop.
But infinite loops without side-effects are UB and the
values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away.
As a consequence, the buffers are not populated and
the multiplication result is just a matrix with all elements
set to 0.

We prevent the UB by making sure that the workgroup size
will never be less than 16, even if our device has a
smaller warp size (e.g. 8).

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-10-03 12:22:17 +03:00
Borislav Stanimirov
31fdf05fda ggml : fix ggml_cast (ggml/973) 2024-10-03 12:22:17 +03:00
Johannes Gäßler
0ac6666cd2 ggml: fix gradient allocation logic (ggml/966)
* ggml: fix gradient allocation logic

* gradient allocation in ggml_build_backward_expand

* fixup

* fix test-backend-ops grad

* suggestions by slaren

* fix test1.c

* fix legacy opt API

* fix test-grad0

* remove keep arg
2024-10-03 12:22:17 +03:00
Georgi Gerganov
6c91da80b8 ggml : define missing HWCAP flags (llama/9684)
ggml-ci

Co-authored-by: Willy Tarreau <w@1wt.eu>
2024-10-03 12:22:17 +03:00
Dan Johansson
c245168ba3 ggml : add run-time detection of neon, i8mm and sve (llama/9331)
* ggml: Added run-time detection of neon, i8mm and sve

Adds run-time detection of the Arm instructions set features
neon, i8mm and sve for Linux and Apple build targets.

* ggml: Extend feature detection to include non aarch64 Arm arch

* ggml: Move definition of ggml_arm_arch_features to the global data section
2024-10-03 12:22:17 +03:00
Markus Tavenrath
280fee8fa0 Enable use to the rebar feature to upload buffers to the device. (llama/9251) 2024-10-03 12:22:17 +03:00
R0CKSTAR
78b4c1c25f mtgpu: enable VMM (llama/9597)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-10-03 12:22:17 +03:00
Charles Xu
1edea2eb4b ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (llama/9217)
* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.

* fix for build errors

* remove prints from the low-level code

* Rebase to the latest upstream
2024-10-03 12:22:17 +03:00
Dou Xinpeng
96808786b7 cann: fix crash when llama-bench is running on multiple cann devices (llama/9627) 2024-10-03 12:22:17 +03:00
Johannes Gäßler
bb57ecb85e CUDA: remove bad assert (ggml/972) 2024-10-03 12:22:17 +03:00
Jeff Bolz
abdb73c7cc vulkan : multithread pipeline creation (ggml/963) 2024-10-03 12:22:17 +03:00
Jeff Bolz
391e548a43 vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961) 2024-10-03 12:22:17 +03:00
Salvatore Mesoraca
2a29afd4c6 vulkan : argsort barriers must be under uniform control flow (ggml/951)
a return before a barrier (that happens only in some threads in
a workgroup) leads to UB.
While the old code actually works on some devices,
it fails on some others (i.e. "smaller" GPUs).

BTW, I think it would be better to set specialization constants
when the graph is built, in that way the local workgroup
could be sized appropriately.
But it would take a lot of work.

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-10-03 12:22:17 +03:00
Georgi Gerganov
5963004ff9 ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969) 2024-10-03 12:22:17 +03:00
gilbertgong
ede1718f6d
server : ffmpeg overwrite leftover temp file (#2431)
* Remove possible leftover ffmpeg temp file from a previous failed conversion

* Revert "Remove possible leftover ffmpeg temp file from a previous failed conversion"

This reverts commit 00797403bd.

* Flag to force ffmpeg to overwrite output file if it exists
2024-10-02 15:06:40 +03:00
Georgi Gerganov
2ef717b293
whisper : add large-v3-turbo (#2440) 2024-10-01 15:57:06 +03:00
Georgi Gerganov
8feb375fbd
tests : remove test-backend-ops (#2434) 2024-09-27 11:49:01 +03:00
Georgi Gerganov
69339af2d1
ci : disable failing CUDA and Java builds 2024-09-25 10:05:04 +03:00
Hugo
0d2e2aed80
readme : fix references to download-ggml-model.sh (#2427)
The script itself has a hashbang indicating that it is a shell script,
but the README indicates that it must be executed with `bash`.

I checked the script itself, and it seems to be valid POSIX shell. I can
confirm that it works with busybox sh.

Clarify the reference on the README, so it is clear that bash is not
actually a dependency for this script.
2024-09-24 21:07:51 +03:00
Georgi Gerganov
451e9ee92c make : remove "talk" target until updated 2024-09-24 19:45:08 +03:00
Georgi Gerganov
1133ac98a8 ggml : add ggml-cpu-impl.h (skip) (#0) 2024-09-24 19:45:08 +03:00
Georgi Gerganov
76d27eec9a sync : ggml 2024-09-24 19:45:08 +03:00
Georgi Gerganov
fe18c29ab8 talk-llama : sync llama.cpp 2024-09-24 19:45:08 +03:00
Eric Zhang
234f9bd320 ggml : add AVX512DQ requirement for AVX512 builds (llama/9622) 2024-09-24 19:45:08 +03:00
Georgi Gerganov
3b183cfae7 log : add CONT level for continuing previous log entry (llama/9610) 2024-09-24 19:45:08 +03:00
Max Krasnyansky
02285dff81 threads: fix msvc build without openmp (llama/9615)
We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.
2024-09-24 19:45:08 +03:00
Ivan
2fc1d20f9e cuda: add q8_0->f32 cpy operation (llama/9571)
llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.
2024-09-24 19:45:08 +03:00
Max Krasnyansky
08e8414f27 threads: improve ggml_barrier scaling with large number of threads (llama/9598)
Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.
2024-09-24 19:45:08 +03:00
Srihari-mcw
05c6139625 ggml : AVX512 gemm for Q4_0_8_8 (llama/9532)
* AVX512 version of ggml_gemm_q4_0_8x8_q8_0

* Remove zero vector parameter passing

* Rename functions and rearrange order of macros

* Edit commments

* style : minor adjustments

* Update x to start from 0

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-24 19:45:08 +03:00
Georgi Gerganov
896c41ef30 metal : use F32 prec for K*Q in vec FA (llama/9595)
ggml-ci
2024-09-24 19:45:08 +03:00
Akarshan Biswas
c36ddc43c6 Revert "[SYCL] fallback mmvq (ggml/9088)" (llama/9579)
This reverts commit 50addec9a532a6518146ab837a85504850627316.
2024-09-24 19:45:08 +03:00
R0CKSTAR
13f41af43e musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (llama/9526)
* mtgpu: add mp_21 support

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: enable unified memory

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-24 19:45:08 +03:00
Molly Sophia
3fc5306b82 Fix merge error in #9454 (llama/9589)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-24 19:45:08 +03:00
Johannes Gäßler
adf2474b10 CUDA: enable Gemma FA for HIP/Pascal (llama/9581) 2024-09-24 19:45:08 +03:00
Molly Sophia
008816a257 RWKV v6: RWKV_WKV op CUDA implementation (llama/9454)
* ggml: CUDA unary op EXP

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: rwkv_wkv op CUDA impl

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-24 19:45:08 +03:00
slaren
33e5a6612e ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (llama/9573) 2024-09-24 19:45:08 +03:00
agray3
f0a7d65b3d Update CUDA graph on scale change plus clear nodes/params (llama/9550)
* Avoid using saved CUDA graph if scale changes and reset nodes/params on update

Fixes https://github.com/ggerganov/llama.cpp/issues/9451

* clear before resize
2024-09-24 19:45:08 +03:00