whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2024-12-21 13:37:47 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	8fb5c6a409	backend : add eval callback (llama/4935) * backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * simple : no need for ggml_is_contiguous + fix bool parse * llama : fix callback placement in llama_context_params * backend : avoid double-ask callback calls * simple : restore examples, imatrix will serve as a demo	2024-01-17 21:21:10 +02:00
Georgi Gerganov	2fe5fbfcc2	metal : create autorelease pool during library build (llama/4970) * metal : create autorelease pool during library build ggml-ci * test : simplify ggml-ci	2024-01-17 21:21:10 +02:00
Kawrakow	01637e1a4c	ggml : importance matrix support for legacy quants (llama/4969) * imatrix: adding support for legacy quants * imatrix: guard Q4_0/Q5_0 against ffn_down craziness --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-17 21:21:10 +02:00
Alex Azarov	1b349eb1f9	metal : log `recommendedMaxWorkingSetSize` on iOS 16+ (llama/4936) * metal: Log `recommendedMaxWorkingSetSize` on iOS 16+ * Only log on iOS and macOS, ignoring tvOS and other platforms * Check for Xcode version before using recommendedMaxWorkingSetSize --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-17 21:21:10 +02:00
Justine Tunney	138eaebead	ggml : introduce GGML_CALL function annotation (llama/4850) This change makes it possible to build ggml-cuda.cu and ggml-metal.m as independent dynamic shared objects, that may be conditionally linked at runtime in a multiplatform binary. It introduces a GGML_CALL annotation that documents which functions have a cyclic call relationship, between the application code and GPU modules. This change does nothing, unless the build defines -DGGML_MULTIPLATFORM which causes back-references and function pointers to conform to MS ABI which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms	2024-01-17 21:21:09 +02:00
Georgi Gerganov	61b9192f27	cuda : fix dequantize kernel names (llama/4938)	2024-01-17 21:21:09 +02:00
Kawrakow	161b51d91a	CUDA: faster dequantize kernels for Q4_0 and Q4_1 (llama/4938) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-17 21:21:09 +02:00
Kawrakow	f904b31a7d	Add ability to use importance matrix for all k-quants (llama/4930) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-17 21:21:09 +02:00
Benjamin Heiniger	f6614155e4	talk-llama : optional wake-up command and audio confirmation (#1765 ) * talk-llama: add optional wake-word detection from command * talk-llama: add optional audio confirmation before generating answer * talk-llama: fix small formatting issue in output * talk-llama.cpp: fix Windows build	2024-01-16 15:52:01 +02:00
Przemysław Pawełczyk	f5f159c320	server : fix building and simplify lib deps on Windows (#1772 ) * make : fix server example building on MSYS2 environments (Windows) It was not working since commit `eff3570f78` when server was introduced. * cmake : simplify server example lib deps on Windows server uses httplib::Server, not httplib::SSLServer, so there is no need to mention cryptographic libraries in target_link_libraries. Winsock (ws2_32) suffices here. Also use plain library names like we use in other places.	2024-01-15 15:48:13 +02:00
Georgi Gerganov	6ebba525f1	talk-llama : sync llama.cpp	2024-01-14 18:08:20 +02:00
Georgi Gerganov	2a5874441d	talk-llama : llama.cpp	2024-01-14 11:06:28 +02:00
Georgi Gerganov	d08445c9ad	sync : ggml	2024-01-14 10:55:18 +02:00
Alex Azarov	4a945696cb	metal : correctly set SIMD support flags on iOS (llama/4923) * Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad * log a little bit more info on iOS	2024-01-14 10:54:09 +02:00
Kawrakow	dabc964d83	2-bit quantizations (llama/4897) * imatrix: load * imatrix: WIP * imatrix: Add Q2_K quantization * imatrix: also guard against Q2_K_S quantization without importance matrix * imatrix: guard even more against low-bit quantization misuse --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-14 10:54:09 +02:00
Georgi Gerganov	654baf693d	scripts : sync-ggml-am.sh add option to skip commits	2024-01-14 10:53:19 +02:00
Georgi Gerganov	f001a3b7b6	talk-llama : sync llama.cpp	2024-01-14 00:13:17 +02:00
Georgi Gerganov	c615f2c335	sync : ggml	2024-01-14 00:12:17 +02:00
Georgi Gerganov	d839dd0242	examples : adapt to metal API	2024-01-14 00:11:45 +02:00
Johannes Gäßler	435847891c	ggml: cache sin/cos for RoPE (llama/4908)	2024-01-14 00:11:45 +02:00
Georgi Gerganov	182f290808	metal : remove old API (llama/4919) ggml-ci	2024-01-14 00:11:45 +02:00
Georgi Gerganov	447dfc11fc	metal : disable log for loaded kernels (llama/4794)	2024-01-14 00:11:45 +02:00
texmex76	9aa9f3b84e	gguf : fix potential infinite for-loop (llama/4600) Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>	2024-01-14 00:11:44 +02:00
Georgi Gerganov	396ebd1e80	metal : refactor kernel loading code (llama/4794) * metal : detect more GPU families * metal : refactor kernel loading * metal : set kernel family requirements * metal : fix kernel init + fix compile options * metal : take into account simdgroup reduction support * metal : print only skipped kernels * metal : fix check for simdgroup reduction support * metal : check for Metal 3 * metal : free allocations * metal : normalize encoder:setComputePipelineStatus calls ggml-ci * metal : fix Metal3 family check ggml-ci * metal : check for simdgroup matrix mul. feature ggml-ci	2024-01-14 00:11:44 +02:00
Johannes Gäßler	12490f4398	CUDA: faster q8_0 -> f16 dequantization (llama/4895)	2024-01-14 00:11:44 +02:00
RhinoDevel	db078a9ba8	talk-llama : add optional CLI arg to set the bot name (#1764 )	2024-01-13 20:51:35 +02:00
james wolf	a13a7da5ad	examples : add python example for transcription (#1744 ) * rebase and add simple python interface * moved python files to examples/python	2024-01-13 19:37:18 +02:00
Georgi Gerganov	519f8e8684	whisper : load the model into multiple buffers of max size 1GB (#1763 )	2024-01-13 17:47:40 +02:00
Georgi Gerganov	40ae0962f4	talk-llama : sync llama.cpp	2024-01-12 22:04:51 +02:00
Georgi Gerganov	1560288048	sync : ggml	2024-01-12 21:56:50 +02:00
slaren	1ad6fafd91	backend_sched : fix assignments ggml-ci	2024-01-12 21:55:42 +02:00
slaren	70840aed5f	llama : ggml-backend integration (llama/4766) * llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (llama/4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (llama/4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-01-12 21:55:42 +02:00
Johannes Gäßler	b24d18feb9	CUDA: fix softmax compile for old CUDA versions (llama/4862)	2024-01-12 21:55:41 +02:00
Kawrakow	3fa98f4395	Importance Matrix calculation (llama/4861) * imatrix: 1st version * imatrix: WIP * Cleanup * Update examples/imatrix/imatrix.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-12 21:55:41 +02:00
Sơn Phan Trung	d05b7ee90e	models : make all scripts to be POSIX Compliant (#1725 ) * download-coreml-model: make it POSIX-compliant * download-ggml-model: posix compliant (2nd) * minor edit * forgot to add newline * generate-coreml-interface: far more straightforward * generate-coreml-model: done with the posix thingy * typo * Update download-ggml-model.sh * fix * fix typo * another fix * Update download-coreml-model.sh * Update download-ggml-model.sh * Update download-coreml-model.sh	2024-01-12 14:11:04 +02:00
Georgi Gerganov	6dcee35129	ggml : fix 32-bit ARM compat for IQ2_XS (#1758 ) * ggml : fix 32-bit ARM compat * ggml : fix fix * ggml : fix fix fix	2024-01-12 14:02:30 +02:00
Boris Bliznioukov	5cb345f5e9	go : add SetInitialPrompt method to bindings (#1753 )	2024-01-12 13:44:50 +02:00
George Hindle	fbcb52d3cd	server : add more parameters to server api (#1754 ) * feat(server): add more parameters to server api * fix(server): reset params to original parsed values for each request	2024-01-12 13:42:52 +02:00
Georgi Gerganov	6b01e3fedd	whisper : fix segment length with params.no_timestamps == true	2024-01-12 13:37:38 +02:00
George Hindle	f7908f9bb8	params : don't compute timestamps when not printing them (#1755 )	2024-01-12 13:24:38 +02:00
Georgi Gerganov	00b7a4be02	talk-llama : sync llama.cpp	2024-01-11 22:10:10 +02:00
Georgi Gerganov	04b0a768b8	swift : remove local ggml.h reference	2024-01-11 22:00:12 +02:00
Georgi Gerganov	87670425f2	swift : track ggml release branch	2024-01-11 21:57:40 +02:00
Georgi Gerganov	32e71a1861	sync : ggml	2024-01-11 21:54:17 +02:00
Georgi Gerganov	9c857cf280	sync : llama.cpp	2024-01-11 21:50:01 +02:00
Kawrakow	97b12212dd	ggml : SOTA 2-bit quants (add IQ2_XS) (llama/4856) * iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-11 21:50:01 +02:00
Paul Tsochantaris	9fa34d79ec	metal : put encoder debug group behind a define (llama/4873)	2024-01-11 21:50:01 +02:00
Georgi Gerganov	a0a64a19dd	metal : improve dequantize precision to match CPU (llama/4836) ggml-ci	2024-01-11 21:50:01 +02:00
Georgi Gerganov	bbc23611fa	ggml : fix vld1q_s8_x4 32-bit compat (llama/4828) * ggml : fix vld1q_s8_x4 32-bit compat ggml-ci * ggml : fix 32-bit ARM compat (cont) ggml-ci	2024-01-11 21:50:01 +02:00
Johannes Gäßler	e9783a1fb4	CUDA: faster softmax via shared memory + fp16 math (llama/4742)	2024-01-11 21:50:01 +02:00

... 2 3 4 5 6 ...

1085 Commits