whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-03-10 14:34:01 +00:00

Author	SHA1	Message	Date
Benjamin Heiniger	f6614155e4	talk-llama : optional wake-up command and audio confirmation (#1765 ) * talk-llama: add optional wake-word detection from command * talk-llama: add optional audio confirmation before generating answer * talk-llama: fix small formatting issue in output * talk-llama.cpp: fix Windows build	2024-01-16 15:52:01 +02:00
Przemysław Pawełczyk	f5f159c320	server : fix building and simplify lib deps on Windows (#1772 ) * make : fix server example building on MSYS2 environments (Windows) It was not working since commit eff3570f78742dfd56024328ed93d4f442434280 when server was introduced. * cmake : simplify server example lib deps on Windows server uses httplib::Server, not httplib::SSLServer, so there is no need to mention cryptographic libraries in target_link_libraries. Winsock (ws2_32) suffices here. Also use plain library names like we use in other places.	2024-01-15 15:48:13 +02:00
Georgi Gerganov	6ebba525f1	talk-llama : sync llama.cpp	2024-01-14 18:08:20 +02:00
Georgi Gerganov	2a5874441d	talk-llama : llama.cpp	2024-01-14 11:06:28 +02:00
Georgi Gerganov	d08445c9ad	sync : ggml	2024-01-14 10:55:18 +02:00
Alex Azarov	4a945696cb	metal : correctly set SIMD support flags on iOS (llama/4923) * Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad * log a little bit more info on iOS	2024-01-14 10:54:09 +02:00
Kawrakow	dabc964d83	2-bit quantizations (llama/4897) * imatrix: load * imatrix: WIP * imatrix: Add Q2_K quantization * imatrix: also guard against Q2_K_S quantization without importance matrix * imatrix: guard even more against low-bit quantization misuse --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-14 10:54:09 +02:00
Georgi Gerganov	654baf693d	scripts : sync-ggml-am.sh add option to skip commits	2024-01-14 10:53:19 +02:00
Georgi Gerganov	f001a3b7b6	talk-llama : sync llama.cpp	2024-01-14 00:13:17 +02:00
Georgi Gerganov	c615f2c335	sync : ggml	2024-01-14 00:12:17 +02:00
Georgi Gerganov	d839dd0242	examples : adapt to metal API	2024-01-14 00:11:45 +02:00
Johannes Gäßler	435847891c	ggml: cache sin/cos for RoPE (llama/4908)	2024-01-14 00:11:45 +02:00
Georgi Gerganov	182f290808	metal : remove old API (llama/4919) ggml-ci	2024-01-14 00:11:45 +02:00
Georgi Gerganov	447dfc11fc	metal : disable log for loaded kernels (llama/4794)	2024-01-14 00:11:45 +02:00
texmex76	9aa9f3b84e	gguf : fix potential infinite for-loop (llama/4600) Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>	2024-01-14 00:11:44 +02:00
Georgi Gerganov	396ebd1e80	metal : refactor kernel loading code (llama/4794) * metal : detect more GPU families * metal : refactor kernel loading * metal : set kernel family requirements * metal : fix kernel init + fix compile options * metal : take into account simdgroup reduction support * metal : print only skipped kernels * metal : fix check for simdgroup reduction support * metal : check for Metal 3 * metal : free allocations * metal : normalize encoder:setComputePipelineStatus calls ggml-ci * metal : fix Metal3 family check ggml-ci * metal : check for simdgroup matrix mul. feature ggml-ci	2024-01-14 00:11:44 +02:00
Johannes Gäßler	12490f4398	CUDA: faster q8_0 -> f16 dequantization (llama/4895)	2024-01-14 00:11:44 +02:00
RhinoDevel	db078a9ba8	talk-llama : add optional CLI arg to set the bot name (#1764 )	2024-01-13 20:51:35 +02:00
james wolf	a13a7da5ad	examples : add python example for transcription (#1744 ) * rebase and add simple python interface * moved python files to examples/python	2024-01-13 19:37:18 +02:00
Georgi Gerganov	519f8e8684	whisper : load the model into multiple buffers of max size 1GB (#1763 )	2024-01-13 17:47:40 +02:00
Georgi Gerganov	40ae0962f4	talk-llama : sync llama.cpp	2024-01-12 22:04:51 +02:00
Georgi Gerganov	1560288048	sync : ggml	2024-01-12 21:56:50 +02:00
slaren	1ad6fafd91	backend_sched : fix assignments ggml-ci	2024-01-12 21:55:42 +02:00
slaren	70840aed5f	llama : ggml-backend integration (llama/4766) * llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (llama/4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (llama/4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-01-12 21:55:42 +02:00
Johannes Gäßler	b24d18feb9	CUDA: fix softmax compile for old CUDA versions (llama/4862)	2024-01-12 21:55:41 +02:00
Kawrakow	3fa98f4395	Importance Matrix calculation (llama/4861) * imatrix: 1st version * imatrix: WIP * Cleanup * Update examples/imatrix/imatrix.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-12 21:55:41 +02:00
Sơn Phan Trung	d05b7ee90e	models : make all scripts to be POSIX Compliant (#1725 ) * download-coreml-model: make it POSIX-compliant * download-ggml-model: posix compliant (2nd) * minor edit * forgot to add newline * generate-coreml-interface: far more straightforward * generate-coreml-model: done with the posix thingy * typo * Update download-ggml-model.sh * fix * fix typo * another fix * Update download-coreml-model.sh * Update download-ggml-model.sh * Update download-coreml-model.sh	2024-01-12 14:11:04 +02:00
Georgi Gerganov	6dcee35129	ggml : fix 32-bit ARM compat for IQ2_XS (#1758 ) * ggml : fix 32-bit ARM compat * ggml : fix fix * ggml : fix fix fix	2024-01-12 14:02:30 +02:00
Boris Bliznioukov	5cb345f5e9	go : add SetInitialPrompt method to bindings (#1753 )	2024-01-12 13:44:50 +02:00
George Hindle	fbcb52d3cd	server : add more parameters to server api (#1754 ) * feat(server): add more parameters to server api * fix(server): reset params to original parsed values for each request	2024-01-12 13:42:52 +02:00
Georgi Gerganov	6b01e3fedd	whisper : fix segment length with params.no_timestamps == true	2024-01-12 13:37:38 +02:00
George Hindle	f7908f9bb8	params : don't compute timestamps when not printing them (#1755 )	2024-01-12 13:24:38 +02:00
Georgi Gerganov	00b7a4be02	talk-llama : sync llama.cpp	2024-01-11 22:10:10 +02:00
Georgi Gerganov	04b0a768b8	swift : remove local ggml.h reference	2024-01-11 22:00:12 +02:00
Georgi Gerganov	87670425f2	swift : track ggml release branch	2024-01-11 21:57:40 +02:00
Georgi Gerganov	32e71a1861	sync : ggml	2024-01-11 21:54:17 +02:00
Georgi Gerganov	9c857cf280	sync : llama.cpp	2024-01-11 21:50:01 +02:00
Kawrakow	97b12212dd	ggml : SOTA 2-bit quants (add IQ2_XS) (llama/4856) * iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-11 21:50:01 +02:00
Paul Tsochantaris	9fa34d79ec	metal : put encoder debug group behind a define (llama/4873)	2024-01-11 21:50:01 +02:00
Georgi Gerganov	a0a64a19dd	metal : improve dequantize precision to match CPU (llama/4836) ggml-ci	2024-01-11 21:50:01 +02:00
Georgi Gerganov	bbc23611fa	ggml : fix vld1q_s8_x4 32-bit compat (llama/4828) * ggml : fix vld1q_s8_x4 32-bit compat ggml-ci * ggml : fix 32-bit ARM compat (cont) ggml-ci	2024-01-11 21:50:01 +02:00
Johannes Gäßler	e9783a1fb4	CUDA: faster softmax via shared memory + fp16 math (llama/4742)	2024-01-11 21:50:01 +02:00
Georgi Gerganov	9e0cc28792	metal : fix deprecation warning (ggml/690)	2024-01-11 21:50:00 +02:00
Timothy Cronin	73072a7c73	ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693)	2024-01-11 21:50:00 +02:00
Jack Mousseau	a8ba1262ff	metal : wrap each operation in debug group (ggml/690)	2024-01-11 21:50:00 +02:00
leejet	e66a9a7806	ggml : change GGML_MAX_NAME at compile time (ggml/682) * change GGML_MAX_NAME to 128 * allow controlling the value of GGML_MAX_NAME through external macro definitions	2024-01-11 21:50:00 +02:00
Halalaluyafail3	338442d773	Fix execlp call (ggml/689) NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.	2024-01-11 21:50:00 +02:00
Kawrakow	10651bddf6	SOTA 2-bit quants (llama/4773) * iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-11 21:50:00 +02:00
Johannes Gäßler	53d4d0b30d	CUDA: fixed redundant value dequantization (llama/4809)	2024-01-11 21:50:00 +02:00
Konstantin Zhuravlyov	2865e4710b	ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (llama/4787)	2024-01-11 21:50:00 +02:00

1 2 3 4 5 ...

927 Commits