whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2024-12-30 09:18:51 +00:00

Author	SHA1	Message	Date
Ma Mingfei	e1936eb2a5	add amx kernel for gemm (llama/8998) add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend	2024-11-01 10:19:05 +02:00
Diego Devesa	28b044dad9	vulkan : add backend registry / device interfaces (llama/9721) * vulkan : add backend registry / device interfaces * llama : print devices used on model load	2024-11-01 10:19:05 +02:00
Gilad S	b8f11a0a17	fix: allocating CPU buffer with size `0` (llama/9917)	2024-11-01 10:19:05 +02:00
Gilad S	ff5a838099	fix: use `vm_allocate` to allocate CPU backend buffer on macOS (llama/9875) * fix: use `vm_allocate` to allocate CPU backend buffer on macOS * fix: switch to `posix_memalign` to keep existing `free()` usages work * feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS * style: formatting * fix: move const outside of `#ifndef` * style: formatting * fix: unused var * fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h` * fix: unused var * fix: page align to `GGUF_DEFAULT_ALIGNMENT` * fix: page align to `TENSOR_ALIGNMENT` * fix: convert `TENSOR_ALIGNMENT` to a macro * fix: increase page size to `32` on iOS * fix: iOS page size * fix: `hbw_posix_memalign` alignment	2024-11-01 10:19:05 +02:00
Johannes Gäßler	84713613be	CUDA: fix 1D im2col, add tests (ggml/993)	2024-11-01 10:19:05 +02:00
leo-pony	ded89c9d08	Fix cann compilation error (llama/9891) Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.	2024-11-01 10:19:05 +02:00
agray3	042e95d92f	Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816) * Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-11-01 10:19:05 +02:00
Diego Devesa	81110c0174	ggml : move more prints to the ggml log system (llama/9839) * ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print	2024-11-01 10:19:05 +02:00
Diego Devesa	c313723860	rpc : add backend registry / device interfaces (llama/9812) * rpc : add backend registry / device interfaces * llama : add llama_supports_rpc API * ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server	2024-11-01 10:19:05 +02:00
R0CKSTAR	e69b2371e2	musa: add docker image support (llama/9685) * mtgpu: add docker image support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-11-01 10:19:05 +02:00
Diego Devesa	1531259b2c	ggml : fix BLAS with unsupported types (llama/9775) * ggml : do not use BLAS with types without to_float * ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies * ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits it's not really internal if everybody uses it	2024-11-01 10:19:05 +02:00
Diego Devesa	44bc2767fd	ggml : add backend registry / device interfaces to BLAS backend (llama/9752) * ggml : add backend registry / device interfaces to BLAS backend * fix mmap usage when using host buffers	2024-11-01 10:19:05 +02:00
Andrew Minh Nguyen	bd7ace7adc	Update building for Android (llama/9672) * docs : clarify building Android on Termux * docs : update building Android on Termux * docs : add cross-compiling for Android * cmake : link dl explicitly for Android	2024-11-01 10:19:05 +02:00
Georgi Gerganov	315364d7de	ggml : add metal backend registry / device (llama/9713) * ggml : add metal backend registry / device ggml-ci * metal : fix names [no ci] * metal : global registry and device instances ggml-ci * cont : alternative initialization of global objects ggml-ci * llama : adapt to backend changes ggml-ci * fixes * metal : fix indent * metal : fix build when MTLGPUFamilyApple3 is not available ggml-ci * fix merge * metal : avoid unnecessary singleton accesses ggml-ci * metal : minor fix [no ci] * metal : g_state -> g_ggml_ctx_dev_main [no ci] * metal : avoid reference of device context in the backend context ggml-ci * metal : minor [no ci] * metal : fix maxTransferRate check * metal : remove transfer rate stuff --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-11-01 10:19:05 +02:00
Paul Tsochantaris	80753d4da8	metal : single allocation of encode_async block (llama/9747) * Single allocation of encode_async block with non-ARC capture in ggml-metal.m * Moving Block_release to the deallocation code * Release encode block when re-setting encoding buffer count if needed * Update ggml/src/ggml-metal.m --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-01 10:19:05 +02:00
Daniel Bevenius	8f9bdca4c4	ggml-alloc : remove buffer_id from leaf_alloc (ggml/987) This commit removes the buffer_id field from the leaf_alloc struct. The motivation for is that this field is only written to and never read/used as far as I can tell. Each tensor_alloc has a buffer_id field and this is what caused me to look into this more closely, to understand what the buffer_id in leaf_alloc was used for.	2024-11-01 10:19:05 +02:00
Georgi Gerganov	aa037a60f3	ggml : alloc ggml_contexts on the heap (#2525 ) * whisper : reduce ggml_context usage * ggml : allocate contexts on the heap (v2) * ggml : aligned malloc -> malloc	2024-10-31 22:00:09 +02:00
SRHMorris	9f346d0084	vulkan : retry allocation with fallback flags (#2451 ) Co-authored-by: Samuel Morris <samuel.morris@artlist.io>	2024-10-06 10:34:20 +03:00
Georgi Gerganov	1ba185f4af	metal : zero-init buffer contexts (#0 )	2024-10-05 15:23:51 +03:00
Georgi Gerganov	941912467d	whisper : adapt to latest ggml (skip) (#0 )	2024-10-05 15:23:51 +03:00
Daniel Bevenius	0b1b094a67	ggml : fix typo in example usage ggml_gallocr_new (ggml/984)	2024-10-05 15:23:51 +03:00
Diego Devesa	40e52a76b9	ggml : fixes after sync (ggml/983) ggml : remove test-backend-buffer ggml : fix CUDA build warnings	2024-10-05 15:23:51 +03:00
Diego Devesa	cf977670e6	ggml-backend : add device and backend reg interfaces (llama/9707) Also: - metal : fix compute pass descriptor autorelease crash - ggml-backend : add device description to CPU backend - ggml: unify backend logging mechanism	2024-10-05 15:23:51 +03:00
Ouadie EL FAROUKI	df2c364de7	Fixed dequant precision issues in Q4_1 and Q5_1 (llama/9711)	2024-10-05 15:23:51 +03:00
Diego Devesa	1acfadb721	ggml-backend : add device and backend reg interfaces (llama/9707) Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-10-05 15:23:51 +03:00
Alberto Cabrera Pérez	ea642144d2	Initial cmake support of SYCL for AMD GPUs (llama/9658) sycl: initial cmake support of SYCL for AMD GPUs	2024-10-05 15:23:51 +03:00
Radoslav Gerganov	282a8654c4	vulkan : do not use tensor->extra (llama/9407) * vulkan : do not use tensor->extra This patch allows using the Vulkan backend with the RPC backend as tensor->extra is no longer used. Ref: #8536 * Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (llama/2) --------- Co-authored-by: 0cc4m <picard12@live.de>	2024-10-05 15:23:51 +03:00
Johannes Gäßler	936cf3beb7	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-05 15:23:51 +03:00
Johannes Gäßler	bc92c2f8f0	ggml: refactor cross entropy loss CPU impl. (ggml/976)	2024-10-05 15:23:51 +03:00
Georgi Gerganov	162a455402	metal : reduce command encoding overhead (llama/9698)	2024-10-03 12:22:17 +03:00
Johannes Gäßler	5e9d6baa48	test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974)	2024-10-03 12:22:17 +03:00
Salvatore Mesoraca	845f8d663e	vulkan : mul_mat: fix UB with small warps (ggml/952) When the device's warp size is less than 16, it is possible for loadstride_a (mul_mm.comp:114) and loadstride_b (mul_mm.comp:115) to be set to 0. Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size. The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication. When they are 0 they cause an infinite loop. But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time. So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0. We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8). Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-10-03 12:22:17 +03:00
Borislav Stanimirov	31fdf05fda	ggml : fix ggml_cast (ggml/973)	2024-10-03 12:22:17 +03:00
Johannes Gäßler	0ac6666cd2	ggml: fix gradient allocation logic (ggml/966) * ggml: fix gradient allocation logic * gradient allocation in ggml_build_backward_expand * fixup * fix test-backend-ops grad * suggestions by slaren * fix test1.c * fix legacy opt API * fix test-grad0 * remove keep arg	2024-10-03 12:22:17 +03:00
Georgi Gerganov	6c91da80b8	ggml : define missing HWCAP flags (llama/9684) ggml-ci Co-authored-by: Willy Tarreau <w@1wt.eu>	2024-10-03 12:22:17 +03:00
Dan Johansson	c245168ba3	ggml : add run-time detection of neon, i8mm and sve (llama/9331) * ggml: Added run-time detection of neon, i8mm and sve Adds run-time detection of the Arm instructions set features neon, i8mm and sve for Linux and Apple build targets. * ggml: Extend feature detection to include non aarch64 Arm arch * ggml: Move definition of ggml_arm_arch_features to the global data section	2024-10-03 12:22:17 +03:00
Markus Tavenrath	280fee8fa0	Enable use to the rebar feature to upload buffers to the device. (llama/9251)	2024-10-03 12:22:17 +03:00
R0CKSTAR	78b4c1c25f	mtgpu: enable VMM (llama/9597) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-10-03 12:22:17 +03:00
Charles Xu	1edea2eb4b	ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (llama/9217) * ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels * added fallback mechanism when the offline re-quantized model is not optimized for the underlying target. * fix for build errors * remove prints from the low-level code * Rebase to the latest upstream	2024-10-03 12:22:17 +03:00
Dou Xinpeng	96808786b7	cann: fix crash when llama-bench is running on multiple cann devices (llama/9627)	2024-10-03 12:22:17 +03:00
Johannes Gäßler	bb57ecb85e	CUDA: remove bad assert (ggml/972)	2024-10-03 12:22:17 +03:00
Jeff Bolz	abdb73c7cc	vulkan : multithread pipeline creation (ggml/963)	2024-10-03 12:22:17 +03:00
Jeff Bolz	391e548a43	vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)	2024-10-03 12:22:17 +03:00
Salvatore Mesoraca	2a29afd4c6	vulkan : argsort barriers must be under uniform control flow (ggml/951) a return before a barrier (that happens only in some threads in a workgroup) leads to UB. While the old code actually works on some devices, it fails on some others (i.e. "smaller" GPUs). BTW, I think it would be better to set specialization constants when the graph is built, in that way the local workgroup could be sized appropriately. But it would take a lot of work. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-10-03 12:22:17 +03:00
Georgi Gerganov	5963004ff9	ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969)	2024-10-03 12:22:17 +03:00
Georgi Gerganov	1133ac98a8	ggml : add ggml-cpu-impl.h (skip) (#0 )	2024-09-24 19:45:08 +03:00
Eric Zhang	234f9bd320	ggml : add AVX512DQ requirement for AVX512 builds (llama/9622)	2024-09-24 19:45:08 +03:00
Georgi Gerganov	3b183cfae7	log : add CONT level for continuing previous log entry (llama/9610)	2024-09-24 19:45:08 +03:00
Max Krasnyansky	02285dff81	threads: fix msvc build without openmp (llama/9615) We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.	2024-09-24 19:45:08 +03:00
Ivan	2fc1d20f9e	cuda: add q8_0->f32 cpy operation (llama/9571) llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.	2024-09-24 19:45:08 +03:00
Max Krasnyansky	08e8414f27	threads: improve ggml_barrier scaling with large number of threads (llama/9598) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended.	2024-09-24 19:45:08 +03:00
Srihari-mcw	05c6139625	ggml : AVX512 gemm for Q4_0_8_8 (llama/9532) * AVX512 version of ggml_gemm_q4_0_8x8_q8_0 * Remove zero vector parameter passing * Rename functions and rearrange order of macros * Edit commments * style : minor adjustments * Update x to start from 0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-24 19:45:08 +03:00
Georgi Gerganov	896c41ef30	metal : use F32 prec for K*Q in vec FA (llama/9595) ggml-ci	2024-09-24 19:45:08 +03:00
Akarshan Biswas	c36ddc43c6	Revert "[SYCL] fallback mmvq (ggml/9088)" (llama/9579) This reverts commit 50addec9a532a6518146ab837a85504850627316.	2024-09-24 19:45:08 +03:00
R0CKSTAR	13f41af43e	musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (llama/9526) * mtgpu: add mp_21 support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable unified memory Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-09-24 19:45:08 +03:00
Molly Sophia	3fc5306b82	Fix merge error in #9454 (llama/9589) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-24 19:45:08 +03:00
Johannes Gäßler	adf2474b10	CUDA: enable Gemma FA for HIP/Pascal (llama/9581)	2024-09-24 19:45:08 +03:00
Molly Sophia	008816a257	RWKV v6: RWKV_WKV op CUDA implementation (llama/9454) * ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-24 19:45:08 +03:00
slaren	33e5a6612e	ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (llama/9573)	2024-09-24 19:45:08 +03:00
agray3	f0a7d65b3d	Update CUDA graph on scale change plus clear nodes/params (llama/9550) * Avoid using saved CUDA graph if scale changes and reset nodes/params on update Fixes https://github.com/ggerganov/llama.cpp/issues/9451 * clear before resize	2024-09-24 19:45:08 +03:00
Georgi Gerganov	54e5095765	examples : adapt to ggml.h changes (ggml/0) ggml-ci	2024-09-24 19:45:08 +03:00
Georgi Gerganov	34291099fb	ggml : refactoring (llama/#0) - d6a04f87 - 23e0d70b	2024-09-24 19:45:08 +03:00
Georgi Gerganov	d245d7aec7	ggml : fix builds (llama/0) ggml-ci	2024-09-24 19:45:08 +03:00
Georgi Gerganov	d661283e68	ggml : fix trailing whitespace (llama/0) ggml-ci	2024-09-24 19:45:08 +03:00
Johannes Gäßler	c0761c95f5	CUDA: fix sum.cu compilation for CUDA < 11.7 (llama/9562)	2024-09-24 19:45:08 +03:00
slaren	138e20b697	ggml : fix n_threads_cur initialization with one thread (llama/9538) * ggml : fix n_threads_cur initialization with one thread * Update ggml/src/ggml.c --------- Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2024-09-24 19:45:08 +03:00
Max Krasnyansky	a8d9abfa22	threadpool : skip polling for unused threads (llama/9461) * threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.	2024-09-24 19:45:08 +03:00
Michael Podvitskiy	195afd6dc1	ggml : link MATH_LIBRARY not by its full path (llama/9339)	2024-09-24 19:45:08 +03:00
Georgi Gerganov	1fd78999e8	cmake : do not hide GGML options + rename option (llama/9465) * cmake : do not hide GGML options ggml-ci * build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS for consistency ggml-ci	2024-09-24 19:45:08 +03:00
Eve	374e9e0c5e	ggml : IQ4_NL sgemm + Q4_0 AVX optimization (llama/9422) * squashed readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before	2024-09-24 19:45:08 +03:00
Georgi Gerganov	a2cb5b4183	metal : handle zero-sized allocs (llama/9466)	2024-09-24 19:45:08 +03:00
Georgi Gerganov	288ae5176e	common : reimplement logging (llama/9418) https://github.com/ggerganov/llama.cpp/pull/9418	2024-09-24 19:45:08 +03:00
Michael Podvitskiy	d868122a5a	cmake : correct order of sycl flags (llama/9497)	2024-09-24 19:45:08 +03:00
Michael Podvitskiy	2ba25fb122	cmake : try to fix sycl+intel build (llama/9487)	2024-09-24 19:45:08 +03:00
Yuri Khrustalev	4f4687cb74	ggml : ggml_type_name return "NONE" for invalid values (llama/9458) When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.	2024-09-24 19:45:08 +03:00
Georgi Gerganov	66b00fad0d	cmake : use list(APPEND ...) instead of set() + dedup linker (llama/9463) * cmake : use list(APPEND ...) instead of set() + dedup linker ggml-ci * cmake : try fix sycl * cmake : try to fix sycl 2 * cmake : fix sycl build (llama/9469) * try fix sycl build * use CMAKE_CXX_FLAGS as a string variable --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * one more CMAKE_CXX_FLAGS fix (llama/9471) --------- Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>	2024-09-24 19:45:08 +03:00
Dou Xinpeng	c6cc8d16c3	cann: Add host buffer type for Ascend NPU (llama/9406) * feat: Add host buffer type for Ascend NPU(CANN backend) * fix some checking errors * Add a few comments	2024-09-24 19:45:08 +03:00
Ahmad Tameem	3f8f8a78a2	riscv : modify Makefile and add a RISCV_VECT to print log info (llama/9442) - Added ggml_cpu_has_riscv_v() in GGML to print system info in log - Modified Makefile to only use flag when cross compiling for RISC-V	2024-09-24 19:45:08 +03:00
Xinpeng Dou	3e47686919	cann: Fix error when running a non-exist op (llama/9424)	2024-09-24 19:45:08 +03:00
Johannes Gäßler	a53b69a003	CUDA: fix --split-mode row race condition (llama/9413)	2024-09-24 19:45:08 +03:00
R0CKSTAR	d1c9b47360	musa: remove Clang builtins mapping (llama/9421) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-09-24 19:45:08 +03:00
Alberto Cabrera Pérez	32f659861a	sycl : update support conditions (llama/9394) * sycl : update support condition to im2col Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> * Added TODO to remind supporting FP32 im2col --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>	2024-09-24 19:45:08 +03:00
Georgi Gerganov	a785232bf9	metal : fix compile warning with GGML_METAL_NDEBUG (llama/0)	2024-09-24 19:45:08 +03:00
Radoslav Gerganov	0677293503	rpc : fix segfault with nkvo (llama/9389) * rpc : fix nkvo * rpc : buf_size must not be static ref: #9337 --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-09-24 19:45:08 +03:00
Prashant Vithule	1fbdb813c0	ggml : vector length agnostic SVE support (llama/9290) * Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Removed WhiteSpaces * ggml : style changes + fix 512-bit nb loop check - fix local scope in switch cases - consistent predicate names - empty lines when necessary - opening braces, spaces - const-correctness - add asserts * Update ggml/src/ggml-quants.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-24 19:45:08 +03:00
Johannes Gäßler	67725ac8f3	CUDA: fix variable name conflict for Windows build (llama/9382)	2024-09-24 19:45:08 +03:00
Markus Tavenrath	dac89af357	Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. (llama/9118) * Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. * fix compile issues * Fix issues where the last submit wasn't executed or handled properly. * remove trailing whitespace * Repair GGML_VULKAN_CHECK_RESULTS * Increase submit counter only if actual work has been submitted and increase submit count to 100. * Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.	2024-09-24 19:45:08 +03:00
Georgi Gerganov	26225f1fb0	cuda : fix FA Q src index (1 -> 0) (llama/9374)	2024-09-24 19:45:08 +03:00
Neo Zhang Jianyu	3468983315	add check malloc result on device (llama/9346) * add check malloc result on device * update for review comments, check all malloc_device() result --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2024-09-24 19:45:08 +03:00
Johannes Gäßler	c7515b0995	ggml/examples: add backend support for numerical optimization (ggml/949) * CUDA eval works * stochastic gradient descent op * Adam except decay * CUDA CROSS_ENTROPY_LOSS_BACK * CUDA mnist-fc training works * backend CLI arg * refactor gguf load * remove sched from opt_step_adam * implement l1 regularization (weight decay) * extra call to add optimizer * initialize gradients with ggml_graph_reset * gradient accumulation * increment iter per eval instead of epoch * adjust backend interfaces * fix ggml_graph_reset without backend * fix ggml graph export/import * fixup * rename * revert ggml_opt changes * more general CUDA repeat_back * update documentation, fix CNN * validation split * add clarifying comment * optimize PyTorch training * adjust buffer size, thread count * fix 0.0f validation split * Update examples/mnist/mnist-common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix gradient accumulation * tensor flag for accumulators -> tensor hash set * Update include/ggml.h Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * fix test prints * Update src/ggml-backend.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * better CUDA support for noncontiguous out_prod * add comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-09-24 19:45:08 +03:00
Georgi Gerganov	253ce30004	examples : add null threadpool args where needed (ggml/0) ggml-ci	2024-09-24 19:45:08 +03:00
Georgi Gerganov	03a6fae484	metal : update support condition for im2col + fix warning (llama/0)	2024-09-24 19:45:08 +03:00
slaren	d37fd275fd	ggml : always check bounds on get_rows operations (llama/9354)	2024-09-24 19:45:08 +03:00
Xuan Son Nguyen	195877fd72	ggml : fix missing `cpu_set_t` on emscripten (llama/9336) * ggml : fix missing cpu_set_t on emscripten * better version * bring back android part	2024-09-24 19:45:08 +03:00
Markus Tavenrath	9e715e1b96	Improve Vulkan shader build system (llama/9239) * Improve Vulkan shader builds system - Add dependency to vulkan-shaders-gen to rebuild shaders when changing the shader compilation utility. - Add option to generate debug info for Vulkan shaders to provide shader source to Vulkan shader profiling tools * remove not required self dependency	2024-09-24 19:45:08 +03:00
compilade	6f5514b6e2	ggml-quants : ternary packing for TriLMs and BitNet b1.58 (llama/8151) * ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b * ggml-quants : faster 1.625 bpw AVX2 vec_dot Not using a lookup table anymore makes it match q4_0 speed. * gguf-py : fix formatting * llama : remove spaces on empty line * ggml-quants : subtract 1 when back in epi8 This makes the 1.625 bpw type go faster than q4_0. Still not the fastest. * ggml-quants : Q2_2 now faster than Q4_K on with AVX2 * ggml-quants : cleanup Q1_3 code formatting * ggml-quants : ARM NEON vec_dot for q2_2 and q1_3 * ggml-quants : use ceiling division when quantizing q1_3 * convert-hf : simplify BitNet pre-quantization This still results in the exact same tensor weights and scales, but it reveals some weirdness in the current algorithm. * convert-hf : allow converting the weird BitNet 1.3B Its FFN size is 5460 which is not convenient. The offending tensors are kept in F16, which makes the final model 5.01 bpw. * bitnet : replace 1.58b with b1.58, as in the paper * ggml-quants : fix build failure on Windows * ggml-quants : attempt to fix Arm 32-bit support * ggml : add some informative comments in q1_3 vec_dot * ggml : add TQ1_0 and TQ2_0 ternary quantization types * ggml : even faster TQ2_0 * ggml : also faster TQ1_0 Same optimization as for TQ2_0 by offsetting the sum instead of the weights. This makes TQ1_0 almost as fast as Q8_0 on AVX2. * ggml : fix build issues in certain environments * ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0 * ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat The compiler seems smart enough to use the same instruction even when using vget_high_s8 instead. * ggml : remove q1_3 and q2_2 No more 1.625 bpw and 2.000 bpw, now instead using 1.6875 bpw and 2.0625 bpw with TQ1_0 and TQ2_0, respectively. * llama : remove the separate scale tensors of BitNet b1.58 They won't be needed, since the remaining ternary quant types have built-in scales. * ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency * ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot Not yet tested on hardware which supports it, might not work or might not even compile. But also it might. It should make the performance better on recent ARM CPUs. * ggml-quants : remove comment about possible format change of TQ2_0 Making it slightly more convenient for AVX512 but less convenient for everything else is not worth the trouble. * gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0 * ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0 This does not change anything for ternary models, since their values should never end up being in halfway cases anyway. * convert : allow direct conversion to TQ1_0 and TQ2_0 The token embeddings and output tensors are kept in F16 to allow quantizing them to Q4_K and Q6_K with llama-quantize. * llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0 Q4_0 is not completely symmetric (so not lossless for ternary models), but it should be good enough. * ggml-quants : allow using ARM dot product instructions for TQ1_0 * ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support * ggml : remove unused ggml_mul special case It would otherwise conflict with the more general optimization coming with Mamba-2. * ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators * test-backend-ops : add TQ1_0 and TQ2_0 comments for later Not yet adding uncommented, because some backends like SYCL and Metal do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT. (and Metal also doesn't handle it with GGML_OP_GET_ROWS) Support for TQ1_0 and TQ2_0 for other backends than CPU will be added in follow-up pull requests.	2024-09-24 19:45:08 +03:00
slaren	709a22b92d	cuda : fix defrag with quantized KV (llama/9319)	2024-09-24 19:45:08 +03:00
Srihari-mcw	01e214a1d7	ggml : AVX2 support for Q4_0_8_8 (llama/8713) * Add AVX2 based implementations for quantize_q8_0_4x8, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions * Update code to fix issues occuring due to non alignment of elements to be processed as multiple of 16 in MSVC * Update comments and indentation * Make updates to reduce number of load instructions	2024-09-24 19:45:08 +03:00
Ouadie EL FAROUKI	1cecfe6a02	Fix DMMV dequantization (llama/9279) Fixed dmmv dequant for ncols== GGML_SYCL_DMMV_X	2024-09-24 19:45:08 +03:00
yuri@FreeBSD	3764bc974c	ggml : add pthread includes on FreeBSD (llama/9258)	2024-09-24 19:45:08 +03:00
Molly Sophia	fcffc912a9	llama : support RWKV v6 models (llama/8980) * convert_hf_to_gguf: Add support for RWKV v6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add RWKV tokenization * Fix build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Do not use special tokens when matching in RWKV tokenizer * Fix model loading * Add (broken) placeholder graph builder for RWKV * Add workaround for kv cache * Add logits conversion to rwkv5 * Add rwkv5 layer norms * Add time mix KVRG & correct merge mistake * Add remaining time mix parameters * Add time mix output loading * Add placeholder llm_build_time_mix * Fix build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Load more tensors for rwkv v6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix rwkv tokenizer Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add unary operator Exp Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV v6 graph building Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add ``rescale_every_n_layers`` parameter Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add ``wkv.head_size`` key for RWKV so it doesn't reuse Mamba ssm parameters Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix offloading layers to CUDA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix parallel inferencing for RWKV Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Remove trailing whitespaces Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * build_rwkv: Avoid using inplace operations Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * convert_hf_to_gguf: rwkv: Avoid using ``eval`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * convert_hf_to_gguf: rwkv tokenizer: Don't escape sequences manually Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * ggml: Add backward computation for unary op ``exp`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Use MODEL_ARCH.RWKV6 instead of MODEL_ARCH.RWKV Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * build_rwkv6: Simplify graph Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Detect model.type Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Fix tensor loading for 7B/14B models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Fix group_norm assertion failure with Metal Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Clean up Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Add quantization tensor exclusion Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Use the new advanced batch splits Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * llama: rwkv6: Use ``ggml_norm`` instead of ``ggml_group_norm`` Co-authored-by: compilade <git@compilade.net> * llama: rwkv6: Apply code style and misc changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * converter: Use class name ``Rwkv6Model`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Make use of key ``feed_forward_length`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Add kv ``time_mix_extra_dim`` and ``time_decay_extra_dim`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * converter: Match ``new_name`` instead of ``name`` for float32 explicit tensors Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Keep ``time_mix_w1/w2`` as F32 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Remove unused nodes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Apply code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Add lora for some supported tensors Currently att.key/receptance/value/gate/output, ffn.receptance/key/value, as well as head.weight Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * rwkv : speed-up tokenization using trie * minor : style + indentation * llama: rwkv6: Avoid division by zero Co-authored-by: compilade <git@compilade.net> * ggml: rwkv_wkv: Avoid copying the state Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Layl Bongers <3094382+LaylBongers@users.noreply.github.com> Co-authored-by: compilade <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-24 19:45:08 +03:00
Faisal Zaghloul	38d40b9972	Threadpool: take 2 (llama/8672) * Introduce ggml_compute_threadpool - OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems * Minor fixes * fixed use after release bug * fixed a harmless race condition * Fix Android bulid issue * fix more race conditions * fix deadlock for cases where cgraph.n_nodes == 1 and fix --poll case * threadpool: use cpu_get_num_math to set the default number of threadpool threads This way we avoid using E-Cores and Hyperthreaded siblings. * bench: create fresh threadpool for each test For benchmarking it's better to start a fresh pool for each test with the exact number of threads needed for that test. Having larger pools is suboptimal (causes more load, etc). * atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior. * threadpool: make polling the default to match openmp behavior All command line args now allow for setting poll to 0 (false). * threadpool: do not wakeup threads in already paused threadpool * fix potential race condition in check_for_work * threadpool: do not create two threadpools if their params are identical * threadpool: reduce pause/resume/wakeup overhead in common cases We now start threadpool in paused state only if we have two. The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead. * threadpool: add support for hybrid polling poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var. poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ... The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms. We can tune this further as things evolve. * threadpool: reduce the number of barrier required New work is now indicated with an atomic counter that is incremented for each new graph that needs to be computed. This removes the need for extra barrier for clearing the "new_work" and removes the special case for trivial graphs. * threadpool: remove special-casing for disposable threadpools With the efficient hybrid polling there is no need to make disposable pools any different. This simplifies the overall logic and reduces branching. Include n_threads in debug print for disposable threadpool. Declare pause and stop flags as atomic_bool This doesn't actually generate any memory barriers and simply informs the thread sanitizer that these flags can be written & read by different threads without locking. * threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs) This fixes the race condition with very small graphs where the main thread happens to start a new graph while the workers are just about to exit from barriers. * threadpool: use relaxed order for chunk sync Full memory barrier is an overkill for this since each thread works on different chunk * threadpool: remove abort_callback from threadpool state * threadpool: better naming for thread/cpumask releated functions * threadpool: consistent use of int type for n_threads params * threadpool: add support for ggml_threadpool_params_default/init Also removes the need for explicit mask_specified param. all-zero cpumask means use default (usually inherited) cpu affinity mask. * threadpool: move typedef into ggml.h * threadpool: fix apply_priority() function name * threadpool: fix swift wrapper errors due to n_threads int type cleanup * threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled * threadpool: replace checks for compute_thread ret code with proper status check * threadpool: simplify threadpool init logic and fix main thread affinity application Most of the init code is now exactly the same between threadpool and openmp. * threadpool: update threadpool resume/pause function names * threadpool: enable openmp by default for now * threadpool: don't forget to free workers state when omp is enabled * threadpool: avoid updating process priority on the platforms that do not require it On Windows we need to change overall process priority class in order to set thread priorities, but on Linux, Mac, etc we do not need to touch the overall process settings. * threadpool: update calling thread prio and affinity only at start/resume This avoids extra syscalls for each graph_compute() * llama-bench: turn threadpool params into vectors, add output headers, etc * llama-bench: add support for cool off between tests --delay This helps for long running tests on platforms that are thermally limited (phones, laptops, etc). --delay (disabled by default) introduces the sleep for N seconds before starting each test. * threadpool: move process priority setting into the apps (bench and cli) This avoids changing the overall process priority on Windows for the apps that use ggml/llama.cpp directy. * threadpool: move all pause/resume logic into ggml * threadpool: futher api cleanup and prep for future refactoring All threadpool related functions and structs use ggml_threadpool prefix. * threadpool: minor indent fixes * threadpool: improve setprioty error message * Update examples/llama-bench/llama-bench.cpp Co-authored-by: slaren <slarengh@gmail.com> * threadpool: fix indent in set_threadpool call * use int32_t for n_thread type in public llama.cpp API * threadpool: use _new and _free instead of _create and _release * fix two more public APIs to use int32_t for n_threads * build: set _GNU_SOURCE for Adroid --------- Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com> Co-authored-by: fmz <quic_fzaghlou@quic.com> Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-09-24 19:45:08 +03:00
Salvatore Mesoraca	09149ee0ae	vulkan: fix compilation with GGML_VULKAN_DEBUG=ON (ggml/948) the old code was trying to print a non-existent field (size) and the struct as a whole (which doesn't have a operator<< override defined). Probably a typo happened during refactoring. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-09-24 19:45:08 +03:00
Salvatore Mesoraca	6b7f37dd5c	vulkan: add dryrun support to sin and cos ops (ggml/947) sin and cos failed test-backend-ops because they tried to dereference a context pointer that is null on dry runs. This commit prevents that segfault. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-09-24 19:45:08 +03:00
Salvatore Mesoraca	791812fb54	vulkan: correctly report support for OP_CONT (ggml/946) test-backend-ops fails because ggml_cont aborts when invoked passing an unsupported type. This commit makes ggml_cont tests pass Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-09-24 19:45:08 +03:00
Johannes Gäßler	5d6dc19f04	tests: add gradient tests for all backends (ggml/932) * tests: add gradient checking to test-backend-ops * remove old comment * reorder includes * adjust SIN/COS parameters * add documentation, use supports_op if possible	2024-09-24 19:45:08 +03:00
Johannes Gäßler	6eb7a0ffbd	ggml: fix ggml_graph_cpy undefined behavior (ggml/943)	2024-09-02 15:24:50 +03:00
Georgi Gerganov	e8f0f9b5f0	cann : fix doxy (ggml/0)	2024-09-02 15:24:50 +03:00
Georgi Gerganov	d8e24b877d	vulkan : fix build (llama/0) ggml-ci	2024-09-02 15:24:50 +03:00
Georgi Gerganov	cc68f31577	cuda : mark BF16 CONT as unsupported	2024-09-02 15:24:50 +03:00
Salvatore Mesoraca	4a4a52bf98	ggml : fix cont with transposed tensors when one dimension is 1 (ggml/934) * ggml_cont: fix issue with transposed tensors when one dimension is 1 when using multiple threads, it is not enough to check for the tensors to be contiguous for ggml_compute_forward_dup_same_cont to work correctly. The tensors strides also need to match. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> * Add ggml_cont tests Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> * Remove dead code it isn't possible to reach this code because all these functions are invoked by ggml_compute_forward_dup if and only if src0->type != dst->type Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> * Make ggml_compute_forward_dup_same_cont work with contiguous tensors Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> --------- Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-02 15:24:50 +03:00
Georgi Gerganov	82b5c56f63	sync : vulkan (skip) (llama/0)	2024-08-28 13:22:20 +03:00
slaren	b2ad484c89	ggml : do not crash when quantizing q4_x_x with an imatrix (llama/9192)	2024-08-28 13:22:20 +03:00
Georgi Gerganov	d96a17848f	metal : separate scale and mask from QKT in FA kernel (llama/9189) * metal : separate scale and mask from QKT in FA kernel * metal : ne01 check no longer necessary * metal : keep data in local memory	2024-08-28 13:22:20 +03:00
Georgi Gerganov	0e7798677a	ggml : add SSM Metal kernels (llama/8546) * ggml : add ggml_ssm_conv metal impl * ggml : add ssm_scan metal impl ggml-ci	2024-08-28 13:22:20 +03:00
slaren	58a36d2e3b	metal : gemma2 flash attention support (llama/9159)	2024-08-28 13:22:20 +03:00
Johannes Gäßler	24d8534bd8	CPU/CUDA: Gemma 2 FlashAttention support (llama/8542) * CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check	2024-08-28 13:22:20 +03:00
Akarshan Biswas	9b16ddd3a5	Add a space to supress a cmake warning (llama/9133)	2024-08-28 13:22:20 +03:00
luoyu-intel	32f88af17b	Add oneDNN primitive support (llama/9091) * add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc	2024-08-28 13:22:20 +03:00
compilade	9bf7250bf9	llama : simplify Mamba with advanced batch splits (llama/8526) * llama : advanced batch splits This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators * llama : fix integer signedness mixing * llama : logits_all has priority over batch->logits Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion. * llama : apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix t5 segfault * llama : fix Mamba session save and restore * llama : minor cosmetic changes * llama : rename llama_reorder_outputs to llama_output_reorder Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs * minor : add struct members for clarity ggml-ci * llama : fix T5 segfault again * llama : fix Mamba pooled embeddings with multiple sequences Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings. * llama : add llama_model_is_recurrent to simplify figuring that out This will make it easier to more cleanly support RWKV-v6 and Mamba-2. * llama : fix simple splits when the batch contains embeddings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-28 13:22:20 +03:00
Meng, Hengyu	17e49d3ab2	fallback mmvq (llama/9088) * fallback mmvq to mul_mat * mmvq in cuda path * Update ggml/src/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com> --------- Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>	2024-08-28 13:22:20 +03:00
zhentaoyu	58b725282a	Fix SYCL `im2col` and `convert` Overflow with Large Dims (llama/9052) * sycl: fix im2col overflow and sync with cuda Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert overflow Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert and dequantize Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix ib in dmmv Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl:refine convert Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: move downsample global_range into common Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: add im2col and convert test cases Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: make new cases only in sycl Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: comment new test_cases for only local testing Signed-off-by: zhentaoyu <zhentao.yu@intel.com> --------- Signed-off-by: zhentaoyu <zhentao.yu@intel.com>	2024-08-28 13:22:20 +03:00
Radoslav Gerganov	7e59afa1e0	rpc : print error message when failed to connect endpoint (llama/9042)	2024-08-28 13:22:20 +03:00
Radoslav Gerganov	5ac022140e	rpc : prevent crashes on invalid input (llama/9040) Add more checks which prevent RPC server from crashing if invalid input is received from client	2024-08-28 13:22:20 +03:00
Nico Bosshard	0eaa67280c	ggml : dynamic ggml_sched_max_splits based on graph_size (llama/9047) * ggml : Dynamic ggml_sched_max_splits based on graph_size * Fixed and readded debug code for causes	2024-08-28 13:22:20 +03:00
Georgi Gerganov	5a62fdb735	cmake : remove unused option GGML_CURL (llama/9011)	2024-08-28 13:22:20 +03:00
Daniel Bevenius	60098d6204	ggml : move rope type enum to ggml.h (llama/8949) * ggml : move rope type enum to ggml.h This commit moves the `llama_rope_type` enum from `llama.h` to `ggml.h` and changes its name to `ggml_rope_type`. The motivation for this change is to address the TODO in `llama.h` and use the enum in ggml. Note: This commit does not change the `mode` parameter to be of type `enum ggml_rope_type`. The name `mode` and its usage suggest that it might be more generic and possibly used as a bit field for multiple flags. Further investigation/discussion may be needed to determine if `mode` should be restricted to RoPE types. * squash! ggml : move rope type enum to ggml.h This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from ggml.h, and back the llama_rope_type enum. I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is safe to remove it yet. * squash! ggml : move rope type enum to ggml.h This commit removes the enum ggml_rope_type from ggml.h and replaces it with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has been updated to reflect this change. * squash! ggml : move rope type enum to ggml.h This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX macro/define to be passed to the shader compiler. * squash! ggml : move rope type enum to ggml.h This commit fixes the editorconfig-checker warnings. * squash! ggml : move rope type enum to ggml.h Update comment for ggml_rope function. * Revert "squash! ggml : move rope type enum to ggml.h" This reverts commit 6261222bd0dc0efd51f0fb0435ad3f16a5b52fd6. * squash! ggml : move rope type enum to ggml.h Add GGML_ROPE_TYPE_NEOX to rope_common.comp. * remove extra line --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-28 13:22:20 +03:00
DavidKorczynski	317293e6a7	ggml: fix div-by-zero (llama/9003) Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724 In order to access the above bug you need to login using one of the emails in https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5 Signed-off-by: David Korczynski <david@adalogics.com>	2024-08-28 13:22:20 +03:00
Markus Tavenrath	488a966c07	Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. (llama/8943) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <picard12@live.de>	2024-08-28 13:22:20 +03:00
Johannes Gäßler	8954769aa2	feat: ref. cross entropy, add CUDA, fix grad test (ggml/929)	2024-08-28 13:22:20 +03:00
Johannes Gäßler	df06468d9e	ggml: remove bad assert (ggml/928)	2024-08-28 13:22:20 +03:00
Johannes Gäßler	1fbd828a5d	examples: add MNIST training + missing ops	2024-08-28 13:22:20 +03:00
Georgi Gerganov	9e3c5345cd	sync : ggml vulkan (ggml/0) ggml-ci	2024-08-21 11:07:13 +03:00
Radoslav Gerganov	b6c05ce82f	yolo : add backend support (ggml/924) * yolo : add backend support * metal : add sub and sqrt kernels --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-21 11:07:13 +03:00
Daniel Bevenius	52c80cac00	ggml : fix typo in ggml-quants.c comment (ggml/922)	2024-08-21 11:07:13 +03:00
Ronsor	3643120690	feat: add new `sin` and `cos` operators (ggml/919) * ggml : add sin/cos operators * ggml-cuda : add sin/cos operators * ggml : add corresponding tests for sin/cos * ggml : add backward computation for sin/cos operators * ggml-vulkan : add sin/cos operators * ggml-vulkan : add sin/cos shader source * metal : add sin, cos --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-21 11:07:13 +03:00
Salvatore Mesoraca	993f0df419	ggml : support forward pass broadcasting in ggml_sub (ggml/914) * ggml: support forward pass broadcasting in ggml_sub Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> * Use assert instead of GGML_ASSERT in ggml_compute_forward_sub_f32 The check is already performed in ggml_sub_impl Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> --------- Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-08-12 11:58:49 +03:00
slaren	9b1788483c	metal : fix uninitialized abort_callback (llama/8968)	2024-08-12 11:58:49 +03:00
Georgi Gerganov	ad37d26983	rpc : sanitize tensor data + warnings (llama/0) Co-authored-by: slaren <slarengh@gmail.com>	2024-08-12 11:58:46 +03:00
Mengqing Cao	81c999fe0a	cann : add Ascend NPU support (#2336 ) * enable Ascend NPU in src/whisper.cpp * sync test-backend-ops with llama.cpp	2024-08-09 15:21:56 +03:00
hipudding	be88ee1d75	ggml : add CANN backend (llama/0) ggml-ci	2024-08-09 09:58:16 +03:00
slaren	ee14c02365	ggml-backend : fix async copy from CPU (llama/8897) * ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same	2024-08-08 22:48:46 +03:00
Ouadie EL FAROUKI	ab39dd34e1	Updated SYCL device filtering (llama/8901) * Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme	2024-08-08 22:48:46 +03:00
Johannes Gäßler	b1348d3530	CUDA/HIP: fix tests/test-backend-ops (llama/8896)	2024-08-08 22:48:46 +03:00
Johannes Gäßler	90641b5cf4	CUDA: fix padding logic for FP16/FP32 (llama/8884)	2024-08-08 22:48:46 +03:00
Molly Sophia	4160b930f1	ggml : add epsilon as a parameter for group_norm (llama/8818) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-08-08 22:48:46 +03:00
Justine Tunney	7a96e661e4	ggml : fix overflows in elu function (llama/8866) It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.	2024-08-08 22:48:46 +03:00
jdomke	a902fb4ab2	ggml : reading the runtime sve config of the cpu (llama/8709) * ggml : reading the runtime sve config of the cpu * change to one time init to prevent performance drop * prefix variable to avoid possible conflicts * revert xxhash fix and add brackets --------- Co-authored-by: domke <673751-domke@users.noreply.gitlab.com>	2024-08-08 22:48:46 +03:00
Sigbjørn Skjæret	6cb38c3673	Fix conversion of unnormalized BF16->BF16 weights (llama/7843) * add truncate_bf16 * truncate intermediate fp32 if converting bf16 to bf16 * fix masking in __compute_fp32_to_bf16 * np.int16 no longer used * missing cast and additional numpy 2.x fix * ggml-impl : do not flush bf16 subnormals to zero * ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used. * missed prototype update in merge * merge cleanup --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2024-08-08 22:48:46 +03:00
Ouadie EL FAROUKI	9cf14ebcbc	Fixing wrong VDR iq4nl value (llama/8812)	2024-08-08 22:48:46 +03:00

1 2 3 4 5 ...

326 Commits