whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-05-22 02:07:56 +00:00

Author	SHA1	Message	Date
slashlib	956ef860bc	cmake : support for CPU BLAS build via Intel MKL (#2024 )	2024-04-09 18:32:46 +03:00
ulatekh	671b4bde6c	main : allow a response-file as the sole parameter (#2019 ) * The "main" example now allows a response-file as the sole parameter. A response-file is a text file with command-line parameters, one per line. Prefix the name of the response-file with "@" to identify it as such. It's used under MS Windows to work around command-line length limits. It may be useful under other platforms to simplify character-escaping. * minor : style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-09 18:31:16 +03:00
ulatekh	c8eeb93a6a	whisper : suppress tokens with a regex (#1997 ) * Allow a regular expression to describe tokens to suppress. Example: --suppress-tokens-re "[,\.]\|[ ]?[0-9]+" will suppress commas, periods, and numeric tokens. Technique inspired by https://github.com/openai/whisper/discussions/1041 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Blind change to fix Java test. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-09 18:27:28 +03:00
ulatekh	319fe5146e	cmake : create solution folders (#2004 ) * Create solution folders in the CMake build. * Fixed non-SDL2 build. * Fixed emscripten build.	2024-04-09 18:23:33 +03:00
Georgi Gerganov	13c22321d1	sync : ggml	2024-04-07 17:04:56 +03:00
Georgi Gerganov	ccbe9d5676	extra : sync grammar-parser	2024-04-07 17:04:22 +03:00
Georgi Gerganov	81a3c41aa0	talk-llama : sync llama.cpp	2024-04-07 16:21:08 +03:00
Georgi Gerganov	a50207c65d	sync : ggml	2024-04-07 16:18:11 +03:00
Georgi Gerganov	97878e53fd	sync : llama.cpp (skip) ggml-ci	2024-04-07 16:15:57 +03:00
Ouadie EL FAROUKI	61b05815e0	Fixed minor bug when enabling FP16 for non intel targets (llama/6464) * moved INTEL_MKL guard from gemm_impl to gemm (wrapper) * Update ggml-sycl.cpp Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com> --------- Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>	2024-04-07 16:15:57 +03:00
slaren	1dce94cf26	ggml : mul_mat_id use the same tensor for all the experts (llama/6387) * ggml : update mul_mat_id to use the same tensor for all the experts * update cuda * minor * update metal * update test-backend-ops * fix cuda * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update convert.py * update convert-hf-to-gguf.py * update convert.py for mixtral hf models * Update convert-hf-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * cuda : support non-pow-2 number of experts * allow quantize to work for split and merged experts models in the same way * cleanup + disable mmap automatically with split tensors models * update imatrix * test-backend-ops : test qwen argsort * update grok model loading * llama : add merged experts tensors to the grok tensor map * minor * gguf : bump version * fix quantizing of merged experts * convert-hf-to-gguf.py : update grok (untested) * make linter happy * cuda/argsort : use shared memory instead of pool memory * convert : fix grok tensor names * metal : add support for non-pow-2 argsort * llama : more loader cleanup, better error checking * cuda : fix warning * llama : still use mmap for loading old models, but copy the data to a host buffer * add review note * llama : remove ffn tensor counting + add sanity check ggml-ci * convert : fix handling of n_experts == None ggml-ci * imatrix : fix ncall counters * llama : produce error if imatrix size does not match * quantize : terminate on errors + trace logs ggml-ci * metal : pad shared memory to 16 bytes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-07 16:15:57 +03:00
Meng, Hengyu	f12e982c0b	Disable iqx on windows as WA (llama/6435) * disable iqx on windows as WA * array instead of global_memory	2024-04-07 16:15:57 +03:00
0cc4m	fa966b9b40	Vulkan k-quant mmq and ggml-backend offload functionality (llama/6155) * Fix Vulkan no kv offload incoherence * Add k-quant mul mat mat shaders * Rework working buffer allocation, reduces vram use noticeably Clean up cpu assist code, replaced with ggml-backend offload function * Default to all dedicated GPUs * Add fallback for integrated GPUs if no dedicated GPUs are found * Add debug info which device is allocating memory * Fix Intel dequant issue Fix validation issue * Fix Vulkan GGML_OP_GET_ROWS implementation * Clean up merge artifacts * Remove Vulkan warning	2024-04-07 16:15:57 +03:00
Neo Zhang Jianyu	b83a9fc9d3	fix set main gpu crash (llama/6339)	2024-04-07 16:15:56 +03:00
slaren	3adbf2fb03	ggml : fix bounds checking of zero size views (llama/6347)	2024-04-07 16:15:56 +03:00
Daniel Bevenius	700d146127	backend : fix typo in scheduler documentation (ggml/781) Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-04-07 16:15:56 +03:00
Georgi Gerganov	a74fde9b4c	extra : sync ggml-cuda folder	2024-04-07 16:10:44 +03:00
Slava Primenko	1d7657f409	ggml: bypass code incompatible with CUDA < 11.1 (#2020 ) `cudaHostRegisterReadOnly` parameter was only introduced in CUDA 11.1 See this issue for more details: https://github.com/ggerganov/whisper.cpp/issues/2007	2024-04-04 14:49:24 +02:00
Przemysław Pawełczyk	ac283dbce7	ci : add building in MSYS2 environments (Windows) (#1994 )	2024-03-30 09:20:20 +02:00
Przemysław Pawełczyk	1e8f28c42a	build : use pkg-config for OpenBLAS (#1778 ) * make : use pkg-config for finding CFLAGS & LDFLAGS needed by OpenBLAS That way building on nix like environments (including MSYS2 on Windows) with WHISPER_OPENBLAS=1 works out of the box. Fix handling of WHISPER_OPENBLAS, so that empty value or 0 won't be misinterpreted by make as enabled. Mind that it's not intended to detect CMake false constants (OFF NO FALSE N). make is not CMake. By default OpenBLAS with 64-bit interface is used, but that can be changed with `WHISPER_OPENBLAS_INTERFACE64=0` if 32-bit one is desired. If OpenBLAS headers and library are respectively in include/ and lib/ subdirectories of given path, then you can specify it, e.g. `OPENBLAS_PATH=/usr/local/openblas`, and this will take precedence over any pkg-config file. If there is no pkg-config file (.pc) for OpenBLAS and OPENBLAS_PATH is empty, then headers are assumed to be in /usr/include/openblas and library as assumed to be called 'openblas64' (or 'openblas' if `WHISPER_OPENBLAS_INTERFACE64=0`). If different headers location should be used, then it can be done, e.g. `WHISPER_BLAS_CFLAGS=-I/usr/local/include/openblas`. If different library should be used, it can be specified, e.g. `WHISPER_BLAS_LIB=openblasp64` (pthreads version as seen on Fedora), or you can provide LDFLAGS needed to link with OpenBLAS directly: `WHISPER_BLAS_LDFLAGS="-L/usr/local/lib/openblas -lopenblas64"`. Current solution is flexible enough to handle most cases out there without needlessly hardcoding possible OpenBLAS installation details. cmake : fix how pkg-config is used for finding include dirs and libraries needed by OpenBLAS That way building on nix like environments (including MSYS2 on Windows) with -DWHISPER_OPENBLAS=ON should work out of the box as long as you have CMake 3.25 or newer. Make OPENBLAS_PATH environment variable supported not only on Windows. It sets OpenBLAS include dir to ${OPENBLAS_PATH}/include and library to ${WHISPER_BLAS_LIB} (name without prefixes and suffixes) in ${OPENBLAS_PATH}/lib and avoids further package finding. By default OpenBLAS with 64-bit interface is used (equivalent to setting `-DWHISPER_BLAS_LIB=openblas64`), but that can be changed with `-DWHISPER_OPENBLAS_INTERFACE64=OFF` (equivalent to setting `-DWHISPER_BLAS_LIB=openblas`) if 32-bit one is desired. Turn on BLA_STATIC for FindBLAS only when WHISPER_STATIC is enabled. BLA_STATIC may not work as expected for pkg-config based operation. Get rid of supporting BLAS_HOME environment variable. If OPENBLAS_PATH is insufficient in your case, there is no pkg-config file to rely on, then you can manually specify include dir, e.g. `-DBLAS_INCLUDE_DIRS=/usr/local/include/openblas`, and library, e.g. `-DBLAS_LIBRARIES=/usr/local/lib/libopenblas.so`. make / cmake : use OpenBLAS with 32-bit interface by default. OpenBLAS w/o INTERFACE64=1 vel USE_64BITINT=1 seems to be more common. * cmake : hardcode "lib" prefix for OpenBLAS lib filename (even on Windows) * cmake : hardcode OpenBLAS library name when building in MSVC (Windows) Most *nix like environments (including MSYS2 on Windows) have OpenBLAS packages that allow coexistence of OpenBLAS builds with 32-bit and 64-bit interface (w/o and w/ OPENBLAS_USE64BITINT defined) and they differ by not having or having "64" suffix in their library filenames. That's not the case for OpenBLAS prebuilt libraries for Windows.	2024-03-29 15:53:26 +02:00
ulatekh	fc366b807a	main : add command-style grammar (#1998 ) * Implemented command-style grammar in the main example. Mostly just copied the relevant parts from the command example. * main : code style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-28 12:02:10 +02:00
Georgi Gerganov	9fb308d90f	make : add grammar parser to common objects	2024-03-28 11:59:48 +02:00
Georgi Gerganov	2948c740a2	sync : ggml (#2001 ) * sync : update scripts * sync : ggml * talk-llama : sync llama.cpp * make : WHISPER_CUBLAS -> WHISPER_CUDA * ci : try to fix sycl build * talk-llama : fix make build	2024-03-27 18:55:10 +02:00
Georgi Gerganov	1558ec5a16	whisper : improve handling of prompts (#1981 ) * whisper : improve handling of prompts * whisper : add whisper_token_count helper	2024-03-25 14:48:19 +02:00
Sanchit Gandhi	fff24a0148	whisper : improve support for distil-large-v3 (#1982 )	2024-03-21 18:53:30 +02:00
Georgi Gerganov	48a145207e	ruby : fix build (#1980 )	2024-03-21 07:40:09 +02:00
Tiago Fassoni	79d5765e7e	docker : libcuda.so.1 in PATH (#1966 )	2024-03-20 18:45:15 +02:00
Mohammadreza Hendiani	04e48094e4	readme : add Fedora dependencies (#1970 ) * README.md fix documentaion and added fedora liunx dependencies for stream build * fix documentaion and added fedora liunx dependencies for command build * fix documentaion and added fedora liunx dependencies for talk build * fix documentaion and added fedora liunx dependencies for talk-llama build * reverted back mistakenly removed MacOS documentaion	2024-03-20 18:42:11 +02:00
denersc	741abb162c	whisper : token-level timestamps with DTW (#1485 ) * whisper.cpp: impl dtw algo * WIP: producing and placing DTW timestamps on tokens * Fix compile and assertion errors. Attempt to DTW timestamp with single_segment=false. * Fix mistake causing incorrect alignment of dtw timestamps * implement N_TOP_MOST and CUSTOM alignment heads setting * whisper: fix typo on alignment heads enum * Fix issues related to changes in whisper.cpp * Fixed excessive memory use when using DTW timestamps. Other minor fixes to DTW timestamping function * decoder: save cross QKs only if requested * Calling median filter with ggml_map_custom1 * Reimpl aheads n_top_most and custom. Sanity checks on chosen aheads * Copying cross QKs from decoder backend correctly * dtw: cleanup * Fix incorrect n_frames passed to dtw when near end of audio * Fix aheads_masks_init for backend != CPU * whisper : minor style * main : add dtw (wip) * whisper: fix invalid memory access in aheads_masks_init * main : add dtw (cont) * whisper : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-20 18:25:26 +02:00
Jo Liss	e7794a868f	examples : rename --audio-context to --audio-ctx per help text (#1953 )	2024-03-18 17:53:33 +02:00
Georgi Gerganov	725350d4ea	whisper : set outputs from conv graph (#1959 )	2024-03-16 17:30:55 +02:00
slaren	906c73b219	alloc : fix allocation data of pre-allocated leafs	2024-03-16 17:15:45 +02:00
Georgi Gerganov	00d80ff965	cmake : copy ggml-common.h to bin	2024-03-16 17:15:44 +02:00
Georgi Gerganov	1b553b9817	gitignore : .vimspector.json	2024-03-16 16:26:35 +02:00
Georgi Gerganov	de4d067f1e	talk-llama : sync llama.cpp	2024-03-15 14:21:59 +02:00
Georgi Gerganov	e715f6a601	sync : ggml	2024-03-15 14:12:19 +02:00
slaren	f60ccfd83b	update examples and tests	2024-03-15 14:01:14 +02:00
Georgi Gerganov	3753a2b2a8	ggml : add ggml-common.h	2024-03-15 14:01:14 +02:00
Georgi Gerganov	592dd25615	ggml : designate enum vals for integer types (llama/6050)	2024-03-15 14:01:14 +02:00
Georgi Gerganov	c8709d4604	metal : build metallib + fix embed path (llama/6015) * metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library	2024-03-15 14:01:14 +02:00
slaren	8932c2d6ce	llama : add pipeline parallelism support (llama/6017) * llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-15 14:01:13 +02:00
AidanBeltonS	2bddfdd7c8	Update get version (llama/6025)	2024-03-15 14:01:13 +02:00
Georgi Gerganov	46e3c3f112	ggml : reuse quantum structs across backends (llama/5943) * ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci	2024-03-15 14:01:13 +02:00
Georgi Gerganov	ef24ae0c7d	ggml : fix UB in IQ2_S and IQ3_S (llama/6012)	2024-03-15 14:01:13 +02:00
Georgi Gerganov	a753926f02	sycl : update IQ1_S kernels (WIP - not working!) (llama/5995) * sycl : try to fix after IQ1_S changes * sycl : iq1s_grid -> iq1s_grid_gpu * sycl : fix grid type	2024-03-15 14:01:13 +02:00
Kawrakow	9dc60fc02d	1.5 bit: we can do even better (llama/5999) * iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-15 14:01:13 +02:00
Michael Podvitskiy	d73a63629e	ggml, ci : Windows ARM runner and build fixes (llama/5979) * windows arm ci * fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64 * fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned` * fix `error C2065: '__fp16': undeclared identifier`	2024-03-15 14:01:13 +02:00
Kawrakow	f79d0d4f74	Better 1.5 bit quantization (llama/5971) * Trying blocvks of 16 for IQ1_S - seems slightly better * iq1s_blocks16: Adjust scale fudge factor to 1.125 * iq1s_blocks16: going to blocks of 32 with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights. * iq1s_blocks16: Use 2<x^2> as sigma2 in weight adjustment iq1s_blocks16: scalar and AVX2 dot products * iq1s_blocks16: CUDA dot product * iq1s_blocks16: Metal works, Neon does not Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now. * iq1s_blocks16: fixed Neon * iq1s_blocks16: very slightly faster TG on Metal Still pathetic at 37 t/s * iq1s_blocks16: speedup Metal by packing codebook into uint32_t's * Formatting * iq1s_blocks16: uint32_t codebook is also better in CUDA TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants * iq1s_blocks16: slightly faster Neon dot product * iq1s_blocks16: faster AVX2 dot product * iq1s_blocks16: adjust to ggml-common.h --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-15 14:01:12 +02:00
Abhilash Majumder	4f88940ff6	Add q3_s and q1_s (llama/5886) * Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space	2024-03-15 14:01:12 +02:00
Georgi Gerganov	7bdb1de9ec	metal : move mm_id indices to shared mem (llama/5982)	2024-03-15 14:01:12 +02:00

1 2 3 4 5 ...

1184 Commits