Georgi Gerganov
fe18c29ab8
talk-llama : sync llama.cpp
2024-09-24 19:45:08 +03:00
Eric Zhang
234f9bd320
ggml : add AVX512DQ requirement for AVX512 builds (llama/9622)
2024-09-24 19:45:08 +03:00
Georgi Gerganov
3b183cfae7
log : add CONT level for continuing previous log entry (llama/9610)
2024-09-24 19:45:08 +03:00
Max Krasnyansky
02285dff81
threads: fix msvc build without openmp (llama/9615)
...
We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.
2024-09-24 19:45:08 +03:00
Ivan
2fc1d20f9e
cuda: add q8_0->f32 cpy operation (llama/9571)
...
llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.
2024-09-24 19:45:08 +03:00
Max Krasnyansky
08e8414f27
threads: improve ggml_barrier scaling with large number of threads (llama/9598)
...
Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.
Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.
2024-09-24 19:45:08 +03:00
Srihari-mcw
05c6139625
ggml : AVX512 gemm for Q4_0_8_8 (llama/9532)
...
* AVX512 version of ggml_gemm_q4_0_8x8_q8_0
* Remove zero vector parameter passing
* Rename functions and rearrange order of macros
* Edit commments
* style : minor adjustments
* Update x to start from 0
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-24 19:45:08 +03:00
Georgi Gerganov
896c41ef30
metal : use F32 prec for K*Q in vec FA (llama/9595)
...
ggml-ci
2024-09-24 19:45:08 +03:00
Akarshan Biswas
c36ddc43c6
Revert "[SYCL] fallback mmvq (ggml/9088)" (llama/9579)
...
This reverts commit 50addec9a532a6518146ab837a85504850627316.
2024-09-24 19:45:08 +03:00
R0CKSTAR
13f41af43e
musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (llama/9526)
...
* mtgpu: add mp_21 support
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* mtgpu: enable unified memory
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-24 19:45:08 +03:00
Molly Sophia
3fc5306b82
Fix merge error in #9454 (llama/9589)
...
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-24 19:45:08 +03:00
Johannes Gäßler
adf2474b10
CUDA: enable Gemma FA for HIP/Pascal (llama/9581)
2024-09-24 19:45:08 +03:00
Molly Sophia
008816a257
RWKV v6: RWKV_WKV op CUDA implementation (llama/9454)
...
* ggml: CUDA unary op EXP
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* ggml: rwkv_wkv op CUDA impl
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-24 19:45:08 +03:00
slaren
33e5a6612e
ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (llama/9573)
2024-09-24 19:45:08 +03:00
agray3
f0a7d65b3d
Update CUDA graph on scale change plus clear nodes/params (llama/9550)
...
* Avoid using saved CUDA graph if scale changes and reset nodes/params on update
Fixes https://github.com/ggerganov/llama.cpp/issues/9451
* clear before resize
2024-09-24 19:45:08 +03:00
Georgi Gerganov
54e5095765
examples : adapt to ggml.h changes (ggml/0)
...
ggml-ci
2024-09-24 19:45:08 +03:00
Georgi Gerganov
34291099fb
ggml : refactoring (llama/#0)
...
- d6a04f87
- 23e0d70b
2024-09-24 19:45:08 +03:00
Georgi Gerganov
d245d7aec7
ggml : fix builds (llama/0)
...
ggml-ci
2024-09-24 19:45:08 +03:00
Georgi Gerganov
d661283e68
ggml : fix trailing whitespace (llama/0)
...
ggml-ci
2024-09-24 19:45:08 +03:00
Johannes Gäßler
c0761c95f5
CUDA: fix sum.cu compilation for CUDA < 11.7 (llama/9562)
2024-09-24 19:45:08 +03:00
slaren
138e20b697
ggml : fix n_threads_cur initialization with one thread (llama/9538)
...
* ggml : fix n_threads_cur initialization with one thread
* Update ggml/src/ggml.c
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
2024-09-24 19:45:08 +03:00
Max Krasnyansky
a8d9abfa22
threadpool : skip polling for unused threads (llama/9461)
...
* threadpool: skip polling for unused threads
Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1).
This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur).
n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written
from one thread and read from other threads (not a race conditions).
* threadpool: further simplify and improve ggml_barrier
Avoid using strict memory order while polling, yet make sure that all threads go through
full memory barrier (memory fence) on ggml_barrier entrace and exit.
* threads: add simple barrier test
This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead.
* threadpool: improve thread sync for new-graphs
Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order
to keep it efficient, once the new graph is detected we do full fence using read-modify-write
with strict memory order.
* threadpool: improve abort handling
Do not use threadpool->ec (exit code) to decide whether to exit the compute loop.
threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it.
Instead introduce atomic threadpool->abort flag used for this. This is consistent with
how we handle threadpool->stop or pause.
While at it add an explicit atomic_load for n_threads_cur for consistency.
* test-barrier: release threadpool before releasing the context
fixes use-after-free detected by gcc thread-sanitizer on x86-64
for some reason llvm sanitizer is not detecting this issue.
2024-09-24 19:45:08 +03:00
Michael Podvitskiy
195afd6dc1
ggml : link MATH_LIBRARY not by its full path (llama/9339)
2024-09-24 19:45:08 +03:00
Georgi Gerganov
1fd78999e8
cmake : do not hide GGML options + rename option (llama/9465)
...
* cmake : do not hide GGML options
ggml-ci
* build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS
for consistency
ggml-ci
2024-09-24 19:45:08 +03:00
Eve
374e9e0c5e
ggml : IQ4_NL sgemm + Q4_0 AVX optimization (llama/9422)
...
* squashed
readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049
have ggml_vec_dot_q4_0 do two blocks per loop for avx
try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue
* shuffle
* remove f16c iq4_nl as i cant make it faster than before
2024-09-24 19:45:08 +03:00
Georgi Gerganov
a2cb5b4183
metal : handle zero-sized allocs (llama/9466)
2024-09-24 19:45:08 +03:00
Georgi Gerganov
288ae5176e
common : reimplement logging (llama/9418)
...
https://github.com/ggerganov/llama.cpp/pull/9418
2024-09-24 19:45:08 +03:00
Michael Podvitskiy
d868122a5a
cmake : correct order of sycl flags (llama/9497)
2024-09-24 19:45:08 +03:00
Michael Podvitskiy
2ba25fb122
cmake : try to fix sycl+intel build (llama/9487)
2024-09-24 19:45:08 +03:00
Yuri Khrustalev
4f4687cb74
ggml : ggml_type_name return "NONE" for invalid values (llama/9458)
...
When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.
2024-09-24 19:45:08 +03:00
Georgi Gerganov
66b00fad0d
cmake : use list(APPEND ...) instead of set() + dedup linker (llama/9463)
...
* cmake : use list(APPEND ...) instead of set() + dedup linker
ggml-ci
* cmake : try fix sycl
* cmake : try to fix sycl 2
* cmake : fix sycl build (llama/9469)
* try fix sycl build
* use CMAKE_CXX_FLAGS as a string variable
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* one more CMAKE_CXX_FLAGS fix (llama/9471)
---------
Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>
2024-09-24 19:45:08 +03:00
Dou Xinpeng
c6cc8d16c3
cann: Add host buffer type for Ascend NPU (llama/9406)
...
* feat: Add host buffer type for Ascend NPU(CANN backend)
* fix some checking errors
* Add a few comments
2024-09-24 19:45:08 +03:00
Ahmad Tameem
3f8f8a78a2
riscv : modify Makefile and add a RISCV_VECT to print log info (llama/9442)
...
- Added ggml_cpu_has_riscv_v() in GGML to print system info in log
- Modified Makefile to only use flag when cross compiling for RISC-V
2024-09-24 19:45:08 +03:00
Xinpeng Dou
3e47686919
cann: Fix error when running a non-exist op (llama/9424)
2024-09-24 19:45:08 +03:00
Johannes Gäßler
a53b69a003
CUDA: fix --split-mode row race condition (llama/9413)
2024-09-24 19:45:08 +03:00
R0CKSTAR
d1c9b47360
musa: remove Clang builtins mapping (llama/9421)
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-24 19:45:08 +03:00
Alberto Cabrera Pérez
32f659861a
sycl : update support conditions (llama/9394)
...
* sycl : update support condition to im2col
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
* Added TODO to remind supporting FP32 im2col
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
2024-09-24 19:45:08 +03:00
Georgi Gerganov
a785232bf9
metal : fix compile warning with GGML_METAL_NDEBUG (llama/0)
2024-09-24 19:45:08 +03:00
Radoslav Gerganov
0677293503
rpc : fix segfault with nkvo (llama/9389)
...
* rpc : fix nkvo
* rpc : buf_size must not be static
ref: #9337
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-09-24 19:45:08 +03:00
Prashant Vithule
1fbdb813c0
ggml : vector length agnostic SVE support (llama/9290)
...
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths
* Removed WhiteSpaces
* ggml : style changes + fix 512-bit nb loop check
- fix local scope in switch cases
- consistent predicate names
- empty lines when necessary
- opening braces, spaces
- const-correctness
- add asserts
* Update ggml/src/ggml-quants.c
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-24 19:45:08 +03:00
Johannes Gäßler
67725ac8f3
CUDA: fix variable name conflict for Windows build (llama/9382)
2024-09-24 19:45:08 +03:00
Markus Tavenrath
dac89af357
Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. (llama/9118)
...
* Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early.
* fix compile issues
* Fix issues where the last submit wasn't executed or handled properly.
* remove trailing whitespace
* Repair GGML_VULKAN_CHECK_RESULTS
* Increase submit counter only if actual work has been submitted and increase submit count to 100.
* Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.
2024-09-24 19:45:08 +03:00
Georgi Gerganov
26225f1fb0
cuda : fix FA Q src index (1 -> 0) (llama/9374)
2024-09-24 19:45:08 +03:00
Neo Zhang Jianyu
3468983315
add check malloc result on device (llama/9346)
...
* add check malloc result on device
* update for review comments, check all malloc_device() result
---------
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-09-24 19:45:08 +03:00
Johannes Gäßler
c7515b0995
ggml/examples: add backend support for numerical optimization (ggml/949)
...
* CUDA eval works
* stochastic gradient descent op
* Adam except decay
* CUDA CROSS_ENTROPY_LOSS_BACK
* CUDA mnist-fc training works
* backend CLI arg
* refactor gguf load
* remove sched from opt_step_adam
* implement l1 regularization (weight decay)
* extra call to add optimizer
* initialize gradients with ggml_graph_reset
* gradient accumulation
* increment iter per eval instead of epoch
* adjust backend interfaces
* fix ggml_graph_reset without backend
* fix ggml graph export/import
* fixup
* rename
* revert ggml_opt changes
* more general CUDA repeat_back
* update documentation, fix CNN
* validation split
* add clarifying comment
* optimize PyTorch training
* adjust buffer size, thread count
* fix 0.0f validation split
* Update examples/mnist/mnist-common.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix gradient accumulation
* tensor flag for accumulators -> tensor hash set
* Update include/ggml.h
Co-authored-by: slaren <slarengh@gmail.com>
* Update tests/test-backend-ops.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* Update tests/test-backend-ops.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* fix test prints
* Update src/ggml-backend.c
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* better CUDA support for noncontiguous out_prod
* add comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-09-24 19:45:08 +03:00
Georgi Gerganov
253ce30004
examples : add null threadpool args where needed (ggml/0)
...
ggml-ci
2024-09-24 19:45:08 +03:00
Georgi Gerganov
03a6fae484
metal : update support condition for im2col + fix warning (llama/0)
2024-09-24 19:45:08 +03:00
slaren
d37fd275fd
ggml : always check bounds on get_rows operations (llama/9354)
2024-09-24 19:45:08 +03:00
Xuan Son Nguyen
195877fd72
ggml : fix missing cpu_set_t
on emscripten (llama/9336)
...
* ggml : fix missing cpu_set_t on emscripten
* better version
* bring back android part
2024-09-24 19:45:08 +03:00
Markus Tavenrath
9e715e1b96
Improve Vulkan shader build system (llama/9239)
...
* Improve Vulkan shader builds system
- Add dependency to vulkan-shaders-gen to rebuild shaders when changing the shader compilation utility.
- Add option to generate debug info for Vulkan shaders to provide shader source to Vulkan shader profiling tools
* remove not required self dependency
2024-09-24 19:45:08 +03:00