whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2024-12-21 21:47:47 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	3170841ed9	talk-llama : sync llama.cpp	2024-02-25 20:00:10 +02:00
Georgi Gerganov	7a6e385c1b	sync : ggml	2024-02-25 19:59:34 +02:00
Georgi Gerganov	578e47e70c	sync : llama.cpp (ggml/0)	2024-02-25 19:58:46 +02:00
Georgi Gerganov	fac5b43830	code : normalize enum names (llama/5697) * coda : normalize enum names ggml-ci * code : cont * code : cont	2024-02-25 19:58:46 +02:00
Kawrakow	9e7c5212a1	IQ3_S: a much better alternative to Q3_K (llama/5676) * iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * Resurrecting iq3_xs After all the experimentation, nothing was better than this. * Minor PPL improvement via a block scale fudge factor * Minor improvement via 3 neighbours * iq3_xs: working scalar and AVX2 dot products * iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s) * iq3_xs: working Metal implementation * Adding IQ3_M - IQ3_XS mix with mostly Q4_K * iiq3_xs: a 3.4375 bpw variant * iq3_xs: make CUDA work for new version * iq3_xs: make scalar and AVX2 work for new version * iq3_s: make ARM_NEON work with new version * iq3_xs: make new version work on metal Performance is very similar to Q3_K_S * iq3_xs: tiny Metal speed improvement * iq3_xs: tiny Metal speed improvement * Fix stupid warning * Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS * iq3_xs: rename to iq3_s * iq3_s: make tests pass * Move Q3_K_XS mix to 3.25 bpw * Attempt to fix failing tests * Another attempt to fix the Windows builds * Attempt to fix ROCm * ROCm again * iq3_s: partial fix for QK_K = 64 * iq3_s: make it work on metal for QK_K = 64 Pleasent surprise: the coding was super-block size independent, so all it took was to delete some QK_K == 256 guards. * Will this fix ROCm? --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-25 19:58:46 +02:00
UEXTM.com	1cb64f7368	Introduce backend GUIDs (ggml/743) * Introduce backend GUIDs Initial proposed implementation of backend GUIDs (Discussed in https://github.com/ggerganov/ggml/pull/741) Hardcoded CPU backend GUID (for now) Change ggml_backend_is_cpu logic to use GUID * Remove redundant functions Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion * Add spaces to match style Co-authored-by: slaren <slarengh@gmail.com> * Fix brace style to match Co-authored-by: slaren <slarengh@gmail.com> * Add void to () in function signature Co-authored-by: slaren <slarengh@gmail.com> * Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid * add guids to all backends ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-25 19:58:45 +02:00
Tamotsu Takahashi	f18738f247	talk, talk-llama : pass text_to_speak as a file (#1865 ) * talk-llama: pass file instead of arg it is too hard to quote text in a portable way * talk-llama: pass heard_ok as a file * talk-llama: let eleven-labs.py accept options Options: -v voice, -s savefile, -p (--play) * talk-llama: check installed commands in "speak" Pass "-q" to eleven-labs.py to skip checking whether elevenlabs is installed * talk-llama: pass voice_id again in order to sync talk with talk-llama * talk: sync with talk-llama Passing text_to_speak as a file is safer and more portable cf. https://stackoverflow.com/a/59036879/45375 * talk and talk-llama: get all installed voices in speak.ps1 * talk and talk-llama: get voices from api * talk and talk-llama: add more options to eleven-labs.py and remove DEFAULT_VOICE because it is deprecated (https://www.reddit.com/r/ElevenLabs/comments/1830abt/what_happened_to_bella/) ``` usage: eleven-labs.py [-q] [-l] [-h] [-n NAME \| -v NUMBER] [-f KEY=VAL] [-s FILE \| -p] [TEXTFILE] options: -q, --quick skip checking the required library action: TEXTFILE read the text file (default: stdin) -l, --list show the list of voices and exit -h, --help show this help and exit voice selection: -n NAME, --name NAME get a voice object by name (default: Arnold) -v NUMBER, --voice NUMBER get a voice object by number (see --list) -f KEY=VAL, --filter KEY=VAL filter voices by labels (default: "use case=narration") this option can be used multiple times filtering will be disabled if the first -f has no "=" (e.g. -f "any") output: -s FILE, --save FILE save the TTS to a file (default: audio.mp3) -p, --play play the TTS with ffplay ``` * examples: add speak_with_file() as suggested in the review * talk and talk-llama: ignore to_speak.txt	2024-02-24 09:24:47 +02:00
Abhilash Majumder	a0ddd8392c	whisper : add SYCL support (#1863 ) * add changes from llama upstream * add sycl abstraction * add sycl build * update cmake * add sycl build config * fix bug * fix bug * refactor build * fix bug * update build * call build * use sycl header * add examples * add target * fix typecast in quant.c * readd fp16 and readme * fix quant typecast * add sample * add readme * remove cxx file check	2024-02-23 09:22:24 +02:00
Georgi Gerganov	a2506909b1	talk-llama : sync llama.cpp	2024-02-22 23:30:53 +02:00
Georgi Gerganov	7b1ff212d9	sync : ggml	2024-02-22 23:25:38 +02:00
Georgi Gerganov	e5d06cfc0f	ggml : always define ggml_fp16_t as uint16_t (llama/5666) * ggml : always define ggml_fp16_t as uint16_t ggml-ci * ggml : cont ggml-ci * ggml : cont * ggml : cont ggml-ci * ggml : cont ggml-ci * cuda : no longer ggml headers last ggml-ci * ggml : fix q6_K FP16 -> FP32 conversion ggml-ci * ggml : more FP16 -> FP32 conversion fixes ggml-ci	2024-02-22 23:25:33 +02:00
Georgi Gerganov	31891db2e3	ci : fix whitespace	2024-02-22 20:20:34 +02:00
Georgi Gerganov	5fdb27ff80	ggml : 32-bit arm compat (#1891 ) * ggml : 32-bit arm compat * ggml : add ggml_vqtbl1q_s8 impl * ggml : cont	2024-02-22 18:31:40 +02:00
Georgi Gerganov	6b16927d18	sync : ggml	2024-02-22 15:15:38 +02:00
Georgi Gerganov	ce411498f6	sync : llama.cpp (ggml/0) ggml-ci	2024-02-22 15:12:36 +02:00
Meng, Hengyu	208de95ac7	conext add name (llama/5624) * [SYCL] conext add name * name should start with SYCL*	2024-02-22 15:12:36 +02:00
AidanBeltonS	c2ce39c795	Update ggml_sycl_op_mul_mat_vec_q (llama/5502) * Update ggml_sycl_op_mul_mat_vec_q * Apply suggestions from code review Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * revert suggestion on macro * fix bug * Add quant type GGML_TYPE_IQ1_S to unsupported * fix format --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-02-22 15:12:36 +02:00
0cc4m	8daa534818	Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init()	2024-02-22 15:12:36 +02:00
0cc4m	9fca69b410	Add check for VK_KHR_portability_enumeration for MoltenVK support	2024-02-22 15:12:36 +02:00
Mathijs de Bruin	b26c645420	Add preprocessor checks for Apple devices. Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files	2024-02-22 15:12:36 +02:00
Mathijs de Bruin	1879ec556e	Resolve ErrorIncompatibleDriver with Vulkan on MacOS. Refs: - https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f - https://github.com/SaschaWillems/Vulkan/issues/954 - https://github.com/haasn/libplacebo/issues/128 - https://github.com/KhronosGroup/Vulkan-Samples/issues/476	2024-02-22 15:12:35 +02:00
Mathijs de Bruin	c6e53cfc46	Allow for Vulkan build with Accelerate. Closes #5304	2024-02-22 15:12:35 +02:00
slaren	b19f2fb815	cuda : ignore peer access already enabled errors (llama/5597) * cuda : ignore peer access already enabled errors * fix hip	2024-02-22 15:12:35 +02:00
Siddharth Ramakrishnan	a6b0950916	ggml : compute forward no longer pass src tensors (ggml/729) * refactored compute forward to not pass in the src tensors each time * fix merge issues with flags * missed one place in the last commit to fix the is_param / flags issue * minor spacing fix * fixed some variable assignments so all tests locally are passing * new change after merge fix --------- Co-authored-by: siddharthvader <siddharth@coinlist.co>	2024-02-22 15:12:35 +02:00
bssrdf	d352dbd163	ggml : fix conv_2d batch mode (ggml/737) Co-authored-by: bssrdf <bssrdf@gmail.com>	2024-02-22 15:12:32 +02:00
st-gr	eb23f4ef16	openvino : fix convert-whisper-to-openvino.py (#1890 ) Fix issue: Conversion from Whisper to OpenVino failed #1870 convert-whisper-to-openvino.py stopped working with OpenVINO version 2023.0.0-10926-b4452d56304-releases/2023/0 . Error was: TypeError: load(): incompatible function arguments. The following argument types are supported: 1. (self: openvino._pyopenvino.FrontEnd, path: object) -> ov::frontend::InputModel Tested successfully with a large-v3 conversion. Co-authored-by: Stefan Grundmann <grundmanns@sandiego.gov>	2024-02-22 15:11:35 +02:00
Davidson Francis	c56344b509	main : fix file existence check in main.cpp (#1889 ) In commit `dda4b0e` of PR #1872, I've introduced a check for the existence of files before loading the model. However, I haven't considered the case where whisper.cpp might read from stdin as well, and in such cases, the checks should ignore the "-" argument as it does not represent a regular file. Additionally, this commit removes the usage of 'stat()' in favor of the recently introduced function 'is_file_exist()' in common.cpp from PR #1871. Apologies for the bug introduced in the previous PR and any inconvenience it may have caused.	2024-02-22 15:01:08 +02:00
Georgi Gerganov	59119f4f20	talk-llama : sync llama.cpp	2024-02-20 12:09:57 +02:00
LBlue	276615d708	make : fix CUBLAS link with WSL (#1878 )	2024-02-20 12:05:38 +02:00
Georgi Gerganov	b602819b6e	sync : ggml	2024-02-19 15:54:25 +02:00
Georgi Gerganov	c2c606f05b	ggml : resolve merge conflicts (ggml/0) ggml-ci	2024-02-19 15:53:25 +02:00
Georgi Gerganov	83afebe872	common : add IQ1_S (ggml/0) ggml-ci	2024-02-19 15:53:25 +02:00
Georgi Gerganov	a4d8f9d559	ci : enable -Werror for CUDA builds (llama/5579) * cmake : pass -Werror through -Xcompiler ggml-ci * make, cmake : enable CUDA errors on warnings ggml-ci	2024-02-19 15:53:24 +02:00
slaren	5ec1e0edfa	cuda, metal : fix nans in soft_max (llama/5574) * cuda : fix nans in soft_max * metal : fix nans in soft_max --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-19 15:53:24 +02:00
bmwl	30a11b1ab8	ggml : android and old glibc NUMA incompatibility bugfixes (llama/5557) * #ifdef out some code NUMA blocks for Android due to lack of support * added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper * Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc * harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways --------- Co-authored-by: root <root@nenya.lothlorien.ca>	2024-02-19 15:53:24 +02:00
Georgi Gerganov	f04e6b87d7	ggml : restore vec dot stride arg names (llama/5453)	2024-02-19 15:53:24 +02:00
Georgi Gerganov	0c33928b55	ci : fix wikitext url + compile warnings (llama/5569) ggml-ci	2024-02-19 15:53:24 +02:00
Georgi Gerganov	0775374750	metal : fix unused warnings (llama/0)	2024-02-19 15:53:24 +02:00
Herman Semenov	7d90bb035b	ggml, common, examples, tests : fixed type arguments in printf (llama/5528)	2024-02-19 15:53:24 +02:00
Kawrakow	2c1ad21ba8	1.5 bit quantization (llama/5453) * iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-19 15:53:23 +02:00
Georgi Gerganov	eca5ff9868	ggml : add ALiBi support for ggml_soft_max_ext (llama/5488)	2024-02-19 15:53:23 +02:00
Ananta Bastola	1b25d2fa0a	ci : add an option to fail on compile warning (llama/3952) * feat(ci): add an option to fail on compile warning * Update CMakeLists.txt * minor : fix compile warnings ggml-ci * ggml : fix unreachable code warnings ggml-ci * ci : disable fatal warnings for windows, ios and tvos * ggml : fix strncpy warning * ci : disable fatal warnings for MPI build * ci : add fatal warnings to ggml-ci ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-19 15:53:23 +02:00
Georgi Gerganov	74a6acc999	cmake : fix VULKAN and ROCm builds (llama/5525) * cmake : fix VULKAN and ROCm builds * cmake : fix (cont) * vulkan : fix compile warnings ggml-ci * cmake : fix ggml-ci * cmake : minor ggml-ci	2024-02-19 15:53:23 +02:00
bmwl	a4ed8a0821	ggml : add numa options (llama/5377) * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverted Makefile * Fixed include * Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables * removed trailing whitespace * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverting Makefile * Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet * Removing MIRROR_MODE code for this PR * Removing last bit of MIRROR_MODE code for this PR * Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static * Fixed lingering init_llama_backend() bool calls in tests and examples * Remote enum llama_numa_strategies * Revert bad merge with dynatemp flags * add missing enum ggml_numa_strategies declaration and revert sync problem with master * add missing enum ggml_numa_strategies declaration * fixed ggml_init_numa variable * Update ggml.h Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges * split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples * Fix up some boolean vs enum comparisons * Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype * Update ggml.h Align enum values Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c Remove whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c align paremeters Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * unified ggml_numa_strategy enum and fixed text alignment in server.cpp example * Update ggml.c simplified return for platforms without NUMA support Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * removed redundant else from cli argument processing of --numa * whitespace --------- Co-authored-by: root <root@nenya.lothlorien.ca> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-02-19 15:53:23 +02:00
slaren	9f675e021c	cuda : print message when initialization fails (llama/5512) * cuda : print message when initialization fails * use CUDA_NAME both times	2024-02-19 15:53:23 +02:00
Neuman Vong	a38efcb9fd	vulkan: Find optimal memory type but with fallback (llama/5381) * @0cc4m feedback * More feedback @0cc4m	2024-02-19 15:53:22 +02:00
AT	31591649a0	Early return for zero size calls to get_tensor. (llama/5482) * Early return for zero size calls to get_tensor. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Update ggml-kompute.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-kompute.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add an early return to the get/set tensor when the size is null. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Early return after the assertions. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Since we do the early return in the generic backend now no reason to do so here as well. Signed-off-by: Adam Treat <treat.adam@gmail.com> --------- Signed-off-by: Adam Treat <treat.adam@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-19 15:53:22 +02:00
Kawrakow	4f5c46a84f	ggml-quants : fix compiler warnings (shadow variable) (llama/5472) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-19 15:53:22 +02:00
Abhilash Majumder	462ffc58db	ggml-sycl: Replace 3d ops with macro (llama/5458) * use macro * use macro * fix format	2024-02-19 15:53:21 +02:00
Georgi Gerganov	65faae0b6a	build : update CBLAS flags + fix unused var warning (#0 )	2024-02-19 14:44:46 +02:00

1 2 3 4 5 ...

1131 Commits