whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-05-31 06:20:58 +00:00

Author	SHA1	Message	Date
Charles Xu	1edea2eb4b	ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (llama/9217) * ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels * added fallback mechanism when the offline re-quantized model is not optimized for the underlying target. * fix for build errors * remove prints from the low-level code * Rebase to the latest upstream	2024-10-03 12:22:17 +03:00
Dou Xinpeng	96808786b7	cann: fix crash when llama-bench is running on multiple cann devices (llama/9627)	2024-10-03 12:22:17 +03:00
Johannes Gäßler	bb57ecb85e	CUDA: remove bad assert (ggml/972)	2024-10-03 12:22:17 +03:00
Jeff Bolz	abdb73c7cc	vulkan : multithread pipeline creation (ggml/963)	2024-10-03 12:22:17 +03:00
Jeff Bolz	391e548a43	vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)	2024-10-03 12:22:17 +03:00
Salvatore Mesoraca	2a29afd4c6	vulkan : argsort barriers must be under uniform control flow (ggml/951) a return before a barrier (that happens only in some threads in a workgroup) leads to UB. While the old code actually works on some devices, it fails on some others (i.e. "smaller" GPUs). BTW, I think it would be better to set specialization constants when the graph is built, in that way the local workgroup could be sized appropriately. But it would take a lot of work. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-10-03 12:22:17 +03:00
Georgi Gerganov	5963004ff9	ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969)	2024-10-03 12:22:17 +03:00
gilbertgong	ede1718f6d	server : ffmpeg overwrite leftover temp file (#2431 ) * Remove possible leftover ffmpeg temp file from a previous failed conversion * Revert "Remove possible leftover ffmpeg temp file from a previous failed conversion" This reverts commit 00797403bd43ebcb1e0678989a4fc676d417b4af. * Flag to force ffmpeg to overwrite output file if it exists	2024-10-02 15:06:40 +03:00
Georgi Gerganov	2ef717b293	whisper : add large-v3-turbo (#2440 )	2024-10-01 15:57:06 +03:00
Georgi Gerganov	8feb375fbd	tests : remove test-backend-ops (#2434 )	2024-09-27 11:49:01 +03:00
Georgi Gerganov	69339af2d1	ci : disable failing CUDA and Java builds	2024-09-25 10:05:04 +03:00
Hugo	0d2e2aed80	readme : fix references to download-ggml-model.sh (#2427 ) The script itself has a hashbang indicating that it is a shell script, but the README indicates that it must be executed with `bash`. I checked the script itself, and it seems to be valid POSIX shell. I can confirm that it works with busybox sh. Clarify the reference on the README, so it is clear that bash is not actually a dependency for this script.	2024-09-24 21:07:51 +03:00
Georgi Gerganov	451e9ee92c	make : remove "talk" target until updated	2024-09-24 19:45:08 +03:00
Georgi Gerganov	1133ac98a8	ggml : add ggml-cpu-impl.h (skip) (#0 )	2024-09-24 19:45:08 +03:00
Georgi Gerganov	76d27eec9a	sync : ggml	2024-09-24 19:45:08 +03:00
Georgi Gerganov	fe18c29ab8	talk-llama : sync llama.cpp	2024-09-24 19:45:08 +03:00
Eric Zhang	234f9bd320	ggml : add AVX512DQ requirement for AVX512 builds (llama/9622)	2024-09-24 19:45:08 +03:00
Georgi Gerganov	3b183cfae7	log : add CONT level for continuing previous log entry (llama/9610)	2024-09-24 19:45:08 +03:00
Max Krasnyansky	02285dff81	threads: fix msvc build without openmp (llama/9615) We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.	2024-09-24 19:45:08 +03:00
Ivan	2fc1d20f9e	cuda: add q8_0->f32 cpy operation (llama/9571) llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.	2024-09-24 19:45:08 +03:00
Max Krasnyansky	08e8414f27	threads: improve ggml_barrier scaling with large number of threads (llama/9598) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended.	2024-09-24 19:45:08 +03:00
Srihari-mcw	05c6139625	ggml : AVX512 gemm for Q4_0_8_8 (llama/9532) * AVX512 version of ggml_gemm_q4_0_8x8_q8_0 * Remove zero vector parameter passing * Rename functions and rearrange order of macros * Edit commments * style : minor adjustments * Update x to start from 0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-24 19:45:08 +03:00
Georgi Gerganov	896c41ef30	metal : use F32 prec for K*Q in vec FA (llama/9595) ggml-ci	2024-09-24 19:45:08 +03:00
Akarshan Biswas	c36ddc43c6	Revert "[SYCL] fallback mmvq (ggml/9088)" (llama/9579) This reverts commit 50addec9a532a6518146ab837a85504850627316.	2024-09-24 19:45:08 +03:00
R0CKSTAR	13f41af43e	musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (llama/9526) * mtgpu: add mp_21 support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable unified memory Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-09-24 19:45:08 +03:00
Molly Sophia	3fc5306b82	Fix merge error in #9454 (llama/9589) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-24 19:45:08 +03:00
Johannes Gäßler	adf2474b10	CUDA: enable Gemma FA for HIP/Pascal (llama/9581)	2024-09-24 19:45:08 +03:00
Molly Sophia	008816a257	RWKV v6: RWKV_WKV op CUDA implementation (llama/9454) * ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-24 19:45:08 +03:00
slaren	33e5a6612e	ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (llama/9573)	2024-09-24 19:45:08 +03:00
agray3	f0a7d65b3d	Update CUDA graph on scale change plus clear nodes/params (llama/9550) * Avoid using saved CUDA graph if scale changes and reset nodes/params on update Fixes https://github.com/ggerganov/llama.cpp/issues/9451 * clear before resize	2024-09-24 19:45:08 +03:00
Georgi Gerganov	54e5095765	examples : adapt to ggml.h changes (ggml/0) ggml-ci	2024-09-24 19:45:08 +03:00
Georgi Gerganov	34291099fb	ggml : refactoring (llama/#0) - d6a04f87 - 23e0d70b	2024-09-24 19:45:08 +03:00
Georgi Gerganov	d245d7aec7	ggml : fix builds (llama/0) ggml-ci	2024-09-24 19:45:08 +03:00
Georgi Gerganov	d661283e68	ggml : fix trailing whitespace (llama/0) ggml-ci	2024-09-24 19:45:08 +03:00
Johannes Gäßler	c0761c95f5	CUDA: fix sum.cu compilation for CUDA < 11.7 (llama/9562)	2024-09-24 19:45:08 +03:00
slaren	138e20b697	ggml : fix n_threads_cur initialization with one thread (llama/9538) * ggml : fix n_threads_cur initialization with one thread * Update ggml/src/ggml.c --------- Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2024-09-24 19:45:08 +03:00
Max Krasnyansky	a8d9abfa22	threadpool : skip polling for unused threads (llama/9461) * threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.	2024-09-24 19:45:08 +03:00
Michael Podvitskiy	195afd6dc1	ggml : link MATH_LIBRARY not by its full path (llama/9339)	2024-09-24 19:45:08 +03:00
Georgi Gerganov	1fd78999e8	cmake : do not hide GGML options + rename option (llama/9465) * cmake : do not hide GGML options ggml-ci * build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS for consistency ggml-ci	2024-09-24 19:45:08 +03:00
Eve	374e9e0c5e	ggml : IQ4_NL sgemm + Q4_0 AVX optimization (llama/9422) * squashed readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before	2024-09-24 19:45:08 +03:00
Georgi Gerganov	a2cb5b4183	metal : handle zero-sized allocs (llama/9466)	2024-09-24 19:45:08 +03:00
Georgi Gerganov	288ae5176e	common : reimplement logging (llama/9418) https://github.com/ggerganov/llama.cpp/pull/9418	2024-09-24 19:45:08 +03:00
Michael Podvitskiy	d868122a5a	cmake : correct order of sycl flags (llama/9497)	2024-09-24 19:45:08 +03:00
Michael Podvitskiy	2ba25fb122	cmake : try to fix sycl+intel build (llama/9487)	2024-09-24 19:45:08 +03:00
Yuri Khrustalev	4f4687cb74	ggml : ggml_type_name return "NONE" for invalid values (llama/9458) When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.	2024-09-24 19:45:08 +03:00
Georgi Gerganov	66b00fad0d	cmake : use list(APPEND ...) instead of set() + dedup linker (llama/9463) * cmake : use list(APPEND ...) instead of set() + dedup linker ggml-ci * cmake : try fix sycl * cmake : try to fix sycl 2 * cmake : fix sycl build (llama/9469) * try fix sycl build * use CMAKE_CXX_FLAGS as a string variable --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * one more CMAKE_CXX_FLAGS fix (llama/9471) --------- Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>	2024-09-24 19:45:08 +03:00
Dou Xinpeng	c6cc8d16c3	cann: Add host buffer type for Ascend NPU (llama/9406) * feat: Add host buffer type for Ascend NPU(CANN backend) * fix some checking errors * Add a few comments	2024-09-24 19:45:08 +03:00
Ahmad Tameem	3f8f8a78a2	riscv : modify Makefile and add a RISCV_VECT to print log info (llama/9442) - Added ggml_cpu_has_riscv_v() in GGML to print system info in log - Modified Makefile to only use flag when cross compiling for RISC-V	2024-09-24 19:45:08 +03:00
Xinpeng Dou	3e47686919	cann: Fix error when running a non-exist op (llama/9424)	2024-09-24 19:45:08 +03:00
Johannes Gäßler	a53b69a003	CUDA: fix --split-mode row race condition (llama/9413)	2024-09-24 19:45:08 +03:00

... 3 4 5 6 7 ...

1871 Commits