whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-05-31 06:20:58 +00:00

Author	SHA1	Message	Date
Pedro Probst	adee3f9c1f	node : add flash_attn param (#2170 )	2024-05-20 09:08:48 +03:00
Tamotsu Takahashi	4798be1f9a	ci: Update build.yml to suppress warnings about node.js versions (#2166 ) * Update actions to suppress warnings about old node.js https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/ * Update actions/upload-artifact, specify android cmdline-tools-version * Use java 20 gradle 8.1 complains against 21 https://docs.gradle.org/current/userguide/compatibility.html	2024-05-19 11:49:26 +03:00
Georgi Gerganov	08981d1bac	release : v1.6.0 v1.6.0	2024-05-15 09:59:48 +03:00
Georgi Gerganov	7094ea5e75	whisper : use flash attention (#2152 ) * whisper : use flash attention in the encoder * whisper : add kv_pad * whisper : remove extra backend instance (huh?) * whisper : use FA for cross-attention * whisper : use FA for self-attention * whisper : simplify encoder FA * whisper : add flash_attn runtime parameter * scripts : add bench log * scripts : add M1 Pro bench log	2024-05-15 09:38:19 +03:00
petterreinholdtsen	9d5771ae43	talk-llama : reject runs without required arguments (#2153 ) * Extended talk-llama example to reject runs without required arguments. Print warning and exit if models are not specified on the command line. * Update examples/talk-llama/talk-llama.cpp * Update examples/talk-llama/talk-llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-14 21:32:41 +03:00
Georgi Gerganov	f56b8305c4	sync : ggml	2024-05-14 19:16:32 +03:00
Georgi Gerganov	1056ad762c	metal : support FA without mask + add asserts (llama/7278) * ggml : fa without mask + add asserts ggml-ci * metal : support non-contiguous KV ggml-ci	2024-05-14 19:16:29 +03:00
Radoslav Gerganov	c451080c8b	ggml : add RPC backend (llama/6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos	2024-05-14 19:16:29 +03:00
Neo Zhang	8e7c22fbdb	rm wait() (llama/7233)	2024-05-14 19:16:29 +03:00
Johannes Gäßler	e57e95eb0d	CUDA: add FP32 FlashAttention vector kernel (llama/7188) * CUDA: add FP32 FlashAttention vector kernel * fixup! CUDA: add FP32 FlashAttention vector kernel * fixup! fixup! CUDA: add FP32 FlashAttention vector kernel * fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel	2024-05-14 19:16:29 +03:00
Georgi Gerganov	130f43e4b8	scripts : sync ggml-rpc	2024-05-14 19:15:35 +03:00
thewh1teagle	d8356a1cc2	whisper : fix model path encoding in windows (#2086 ) * fix: model path encoding in windows * fix: convert model path to wide string only for MSVC compiler	2024-05-14 09:43:41 +03:00
Georgi Gerganov	4ef8d9f44e	server : return utf-8 (#2138 )	2024-05-13 15:33:46 +03:00
Pedro Probst	3928dbd206	node : add audio_ctx and audio buffer params (#2123 ) * node : add audio_ctx param * node : support passing audio buffer directly * node : parse audio_ctx in index.js --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-13 15:22:23 +03:00
aldorof	2ced6f0742	cmake : fix HIP/ROCm build (#2102 )	2024-05-13 15:18:43 +03:00
valVk	30f73109b8	node : add additional params (#2000 ) * Add additional params to addon.node * Add comma_in_time as parameter * Fix tests	2024-05-13 15:15:43 +03:00
Mark Karpelès	17fa62d3d3	js : remove un-needed request header from fetchRemote (#2119 )	2024-05-13 15:13:19 +03:00
Georgi Gerganov	1da5edcde0	cmake : fix metal embed sources path (#2110 )	2024-05-13 15:09:59 +03:00
Daniel Ziegenberg	0bb05b113d	main : dont print timings with --no-prints (#2108 ) Signed-off-by: Daniel Ziegenberg <daniel@ziegenberg.at>	2024-05-13 15:00:19 +03:00
Daniel Ziegenberg	f141b2b938	main : add options for temperature control (#2088 ) Add two options: ``` -tp, --temperature N [0.00 ] The sampling temperature, between 0 and 1 -tpi, --temperature-inc N [0.20 ] The increment of temperature, between 0 and 1 ``` The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit. Signed-off-by: Daniel Ziegenberg <daniel@ziegenberg.at>	2024-05-13 14:59:44 +03:00
Georgi Gerganov	2b434c449e	whisper : switch back to F32 mask (#0 )	2024-05-13 14:43:43 +03:00
zhangjixiong	e93081f83f	whisper.android : update example, add field to print timestamp (#2072 )	2024-05-13 14:30:03 +03:00
Xingchen Song(宋星辰)	b6bbce4ae9	cmake : fix json INTERFACE library (#2069 )	2024-05-13 14:29:39 +03:00
mashizora	7705dc52da	main : fix double quote escaping in csv output (#2090 )	2024-05-13 11:55:32 +03:00
Georgi Gerganov	e6acaf9d91	metal : tune soft_max number of threads (#0 )	2024-05-13 11:02:26 +03:00
Georgi Gerganov	2c81e6fd51	whisper : remove old flash attn code (#0 )	2024-05-13 11:02:26 +03:00
Georgi Gerganov	9506267ce5	ggml : try fix ppc64 (#0 )	2024-05-13 11:02:26 +03:00
Georgi Gerganov	fbeb80b5f0	ggml : remove oboslete alibi code (skipme) (#0 )	2024-05-13 11:02:26 +03:00
Georgi Gerganov	3fa7d29876	talk-llama : sync llama.cpp	2024-05-13 11:02:26 +03:00
Georgi Gerganov	fe179ae0cc	sync : ggml	2024-05-13 11:02:26 +03:00
Hong Bo PENG	40aeeeecc4	ggml : optimize for ppc64le using VSX intrinsics (ggml/784) * optimize for ppc64le using VSX intrinsics * 1. code clean up by removing comments about overflow concern. 2. fix typo in suffix of scaling. * Continue to fix typo in suffix of scaling for QK_K <> 256 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-13 11:02:26 +03:00
Georgi Gerganov	5a863fbe18	metal : fix indent (ggml/0)	2024-05-13 11:02:26 +03:00
Georgi Gerganov	91c646c61d	ggml : restore sigmoid decl order (ggml/0)	2024-05-13 11:02:26 +03:00
Georgi Gerganov	accada542a	ggml : resolve merge (ggml/0) ggml-ci	2024-05-13 11:02:26 +03:00
Georgi Gerganov	e54329da7b	ggml : full ALiBi support (llama/7192) * ggml : full ALiBi support * ggml : update ggml_soft_max_ext() CUDA, SYCL * ggml : ggml_flash_attn_ext() support ALiBi (CPU) * ggml : ggml_flash_attn_ext() support ALiBi (Metal) * ggml : fix warning * ggml : ggml_flash_attn_ext() support ALiBi (CUDA) ggml-ci * ggml : fix assert message * vulkan : add dev notes * ggml : require mask when using ALiBi ggml-ci * convert : fix convert for refact models	2024-05-13 11:02:26 +03:00
Georgi Gerganov	284fac39fb	metal : fix flash attention kernel requirements (llama/7169) * metal : fix flash attention kernel requirements ggml-ci * metal : fix ggml_metal_supports_op ggml-ci	2024-05-13 11:02:26 +03:00
Ouadie EL FAROUKI	fe454b8d9e	Minor arithmetic improvement to mmvq wrapper kernel (llama/7172)	2024-05-13 11:02:26 +03:00
0cc4m	c114b75aee	Vulkan Bugfixes and Improvements (llama/7084) * Modify mat mat mul shader for mul_mat_id, modify mat vec mul shaders for single call batch operation * Further work towards MoE, disabled for now * Disable MoE code (not ready yet), fix a number of bugs in shaders and Vulkan code * Add softmax with f16 mask and pos buffer support * Disable mul_mat_id shaders for now * Fix flake8 * Fix validation errors caused by empty buffers on larger batch sizes	2024-05-13 11:02:26 +03:00
Johannes Gäßler	4be936b88b	CUDA: generalize FP16 fattn vec kernel (llama/7061) * CUDA: generalize FP16 fattn vec kernel * disable unsupported head sizes for AMD in test * try AMD fix * fix batch size 2-8 * partially revert changes	2024-05-13 11:02:26 +03:00
Albert Jin	26c550f772	opencl : alignment size converted from bits to bytes (llama/7090) * opencl alignment size should be converted from bits to bytes Reference: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#CL_DEVICE_MEM_BASE_ADDR_ALIGN > Alignment requirement (in bits) for sub-buffer offsets. * Update ggml-opencl.cpp for readability using division instead of shift Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-05-13 11:02:26 +03:00
agray3	24f0aa460b	Introduction of CUDA Graphs to LLama.cpp (llama/6766) * DRAFT: Introduction of CUDA Graphs to LLama.cpp * FIx issues raised in comments * Tidied to now only use CUDA runtime (not mixed with driver calls) * disable for multi-gpu and batch size > 1 * Disable CUDA graphs for old GPU arch and with env var * added missing CUDA_CHECKs * Addressed comments * further addressed comments * limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake * Added more comprehensive graph node checking * With mechanism to fall back if graph capture fails * Revert "With mechanism to fall back if graph capture fails" This reverts commit eb9f15fb6fcb81384f732c4601a5b25c016a5143. * Fall back if graph capture fails and address other comments * - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS - rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS - updated Makefile build to enable CUDA graphs - removed graph capture failure checking in ggml_cuda_error using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context - fixed several resource leaks - fixed issue with zero node graphs - changed fixed size arrays to vectors - removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed - removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row - changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX - code style fixes - things to look into - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes * fix build without cuda graphs * remove outdated comment * replace minimum cc value with a constant --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-05-13 11:02:26 +03:00
Gilad S	69efc39d5c	metal : use `vm_allocate` instead of `posix_memalign` on macOS (llama/7078) * fix: use `malloc` instead of `posix_memalign` in `ggml-metal.m` to make it not crash Electron proccesses * fix: typo * fix: use `vm_allocate` instead of `posix_memalign` * fix: don't call `newBufferWithBytesNoCopy` with `NULL` when `ggml_metal_host_malloc` returns `NULL` * fix: use `vm_allocate` only on macOS	2024-05-13 11:02:26 +03:00
Justine Tunney	a2ad810118	ggml : introduce bfloat16 support (llama/6412) * Introduce bfloat16 support Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 brain16 This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───────────────────┐ 0b00000000000000000000000000000000 IEEE binary32 The issue is that converting bf16 to fp16 can result in information loss. Only 13% of bf16 numbers can be precisely represented in fp16 which in practice ends up being 99.71% of Mistral 7b v0.2's weights however there is currently no way other than fp32 to get the others ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌─┴─┐┌─┴──────┐ 0b0000000000000000 IEEE binary16 This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16 * Remove GGML code that's not needed * Minimize the GGML API surface area for BF16 * Remove bf16 luts * Make the GGML header look nicer * Fix documentation * Apply ggerganov's fixes for test-backend-ops * Add BF16 code for new ggml_validate_row_data() function	2024-05-13 11:02:26 +03:00
Georgi Gerganov	1ae1a9cd56	metal : fix unused warning	2024-05-13 11:02:26 +03:00
William Tambellini	b5521fea19	Add an option to build without CUDA VMM (llama/7067) Add an option to build ggml cuda without CUDA VMM resolves https://github.com/ggerganov/llama.cpp/issues/6889 https://forums.developer.nvidia.com/t/potential-nvshmem-allocated-memory-performance-issue/275416/4	2024-05-13 11:02:26 +03:00
Xuan Son Nguyen	9b84195225	gguf-split: add --no-tensor-first-split (llama/7072)	2024-05-13 11:02:26 +03:00
Johannes Gäßler	11c1df0436	CUDA: CUDART < 11.7 workaround for __hmax, __hmax2 (llama/7019)	2024-05-13 11:02:26 +03:00
Kevin Gibbons	c754494fdd	switch to using localizedDescription (llama/7010)	2024-05-13 11:02:26 +03:00
Georgi Gerganov	1bce67999d	metal : remove deprecated error code (llama/7008)	2024-05-13 11:02:26 +03:00
Kevin Gibbons	6c39ea46b6	metal : log more info on error (llama/6987)	2024-05-13 11:02:26 +03:00

1 2 3 4 5 ...

1288 Commits