whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-06-20 15:40:28 +00:00

Author	SHA1	Message	Date
slaren	bf5fc81a8a	ggml : fix another case of quants nans (llama/7387)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	2b07dc3186	ggml: implement quantized KV cache for FA (llama/7372)	2024-06-16 18:19:48 +03:00
slaren	951c463d39	cuda : clear error after buffer allocation failure (llama/7376)	2024-06-16 18:19:48 +03:00
fraxy-v	7f257b210f	Capture CUDA logging output (llama/7298) * logging: output capture in cuda module * fix compile error * fix: vsnprintf terminates with 0, string use not correct * post review * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
Georgi Gerganov	705fe30a02	android : use "ci-android" branch for CI (llama/7341) * android : use "ci-android" branch for CI * ggml : disable SIMD exp and silu for 32-bit ARM ggml-ci * android : do not fetch, use add_subdirectory instead * cmake : provide binary dir	2024-06-16 18:19:48 +03:00
Johannes Gäßler	45b5b95e29	CUDA: deduplicate FlashAttention code (llama/7352)	2024-06-16 18:19:48 +03:00
Engininja2	f2c47d1e6a	cuda : add half2 __shfl_xor() for ROCm 5.5 (llama/7263)	2024-06-16 18:19:48 +03:00
0cc4m	b4bb9b9036	Update and fix Vulkan soft_max and argsort implementations (llama/7237) * Update and fix Vulkan softmax implementation * Update and fix Vulkan argsort implementation	2024-06-16 18:19:48 +03:00
slaren	2bc6483299	ggml : fix quants nans when all the group weights are very close to zero (llama/7313)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	ec52f900e4	CUDA: faster large batch FA without tensor cores (llama/7314)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	77d708fabb	rpc : set SO_REUSEADDR for the server socket (llama/7320) ref: #7293	2024-06-16 18:19:48 +03:00
Herman Semenov	c00149c861	ggml-quants, llama : removed excess checks (llama/7274)	2024-06-16 18:19:48 +03:00
Justine Tunney	574661f2e6	ggml : rewrite silu and softmax for cpu (llama/7154) This change upstreams llamafile's vectorized expf() functions. This lets us compute softmax and silu more accurately than the short[65536] lookup table that GGML previously used to make this operation go faster. We can support aarch64 and sse2+ with the worst case rounding error of 2ulp. It makes make -j8 tests && ./tests/test-backend-ops -o SOFT_MAX -b CPU perf go 1.5x faster for SSE2+FMA, 1.9x faster for AVX2+FMA and 2.1x on AVX512	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	7bd69349bf	rpc : add command line arg for specifying backend memory ref: #7293	2024-06-16 18:19:48 +03:00
Max Krasnyansky	488ad99c13	Add support for properly optimized Windows ARM64 builds with LLVM and MSVC (llama/7191) * logging: add proper checks for clang to avoid errors and warnings with VA_ARGS * build: add CMake Presets and toolchian files for Windows ARM64 * matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings * ci: add support for optimized Windows ARM64 builds with MSVC and LLVM * matmul-int8: fixed typos in q8_0_q8_0 matmuls Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * matmul-int8: remove unnecessary casts in q8_0_q8_0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
kunnis	7178cceeaa	ggml : use dynamic thread scheduling for matrix multiplication (llama/6915) * Just reordering some structs. * Adding in the calls to mm_pause * Passing around the state * Renaming and moving a bunch of variables around. * Extracting the logic to it's own function. * Moving some variable definitions into the chunk function. * Moving some variables around * moving src1_cont inside * Moving row_size * adding the current_chunk * Reorg the code. * Formatting to match the orig patch * starting to setup the chunking variables * Starting the buildup of the loop * The yield shouldn't be necessary. * adding the looping structure based on the chunk configuration. * Add in the re-chunking code. * Making it much more likely to rechunk. * disable resizing if numa is enabled. * Updating comments with what we've learned. * Fix formatting * Couple more formatting fixes. * More style fixes. * Fix Warnings * Going with unused because there's conditional logic that needs it. * Update ggml.c * Update ggml.c ---------	2024-06-16 18:19:48 +03:00
agray3	8d55ccdb8c	Avoid unnecessarily disabling CUDA graphs (llama/7302) As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts. This fixes the issue by avoiding the consective update counter from incrementing unnecessarily for tokens in which cuda graphs are disabled due to batch size > 1.	2024-06-16 18:19:48 +03:00
slaren	37a72cb170	ggml : tag ggml_tensor::backend as deprecated (llama/7290)	2024-06-16 18:19:48 +03:00
AidanBeltonS	bf9b69284f	Add missing " (llama/7303)	2024-06-16 18:19:48 +03:00
John Balis	c4de1e19df	ggml : add `ggml_upscale_ext` (ggml/814) * initial commit with CPU implementation of upscale to shape and test, cuda implementation next * experimental commit to see if dst shape is correct * test version * test * removed unnecessary params * refactor * fixed tests * ggml : metal impl + cleanup + sycl dev warnings * patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior * metal : fix upsacle op to support nb00 + style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
Georgi Gerganov	5b7073cae1	scripts : update sync	2024-06-16 12:41:42 +03:00
Borislav Stanimirov	b29b3b2924	whisper : use ggml-cuda in mel calc, set appropriate device (#2236 ) * whisper : use ggml-cuda in mel calc, set appropriate device * whisper : forbid cuda mel calc on devices with compute < 600, workaround for #2230	2024-06-13 13:16:07 +03:00
Georgi Gerganov	420b6abc54	cuda : fix HIPBLAS build (#2234 )	2024-06-11 19:14:38 +03:00
Georgi Gerganov	99804b0f3e	cuda : fix bounds check for src0 rows in MMVQ kernel (#2231 ) * cuda : fix bounds check for src0 rows in MMVQ kernel * Update ggml-cuda/mmvq.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-06-11 17:39:01 +03:00
Georgi Gerganov	c55964c956	ci : fix CUDA builds (#2232 )	2024-06-11 17:21:30 +03:00
Borislav Stanimirov	20c542c713	whisper : auto-grow working areas for mel_calc_cuda (#2227 ) * whisper : auto-grow working areas for mel_calc_cuda, fixes #2226 * whisper : only calculate mel spectrogram on GPU if audio is <= 5 min	2024-06-10 21:51:32 +03:00
Georgi Gerganov	c2bdb960cd	whisper : free whisper_mel instances (#2220 )	2024-06-10 11:00:15 +03:00
Georgi Gerganov	87acd6d629	whisper : whisper_state/backend fixes (#2217 ) * whisper : fixes * ci : WHISPER_CUBLAS -> WHISPER_CUDA	2024-06-06 18:51:36 +03:00
Borislav Stanimirov	f842d31171	whisper : calculate mel spectrogram directly into a ggml_tensor (#2208 ) * whisper : calculate mel spectrogram directly into a ggml_tensor * whisper : remove unused temp buffer from state * whisper : fix not initializing wstate.embd_enc	2024-06-06 16:20:46 +03:00
Borislav Stanimirov	ffef323c4c	whisper : add CUDA-specific computation mel spectrograms (#2206 ) * whisper : use polymorphic class to calculate mel spectrogram * whisper : add cuda-specific mel spectrogram calculation * whisper : conditionally compile cufftGetErrorString to avoid warnings * build : add new files to makefile * ruby : add new files to conf script * build : fix typo in makefile * whisper : suppress cub warning for deprecated C++ std in whisper-mel-cuda	2024-06-04 09:32:23 +03:00
Borislav Stanimirov	af5833e298	whisper : remove `speed_up` and `phase_vocoder` functions (#2198 ) whisper : fix cast warning * whisper : remove phase_vocoder functions, ref #2195 * whisper : remove speed_up from whisper_full_params, closes #2195	2024-05-31 11:37:29 +03:00
Martin Delille	b87494bb8f	readme : add conan badge (#2196 ) * Add conan badge * Fix markdown formating	2024-05-30 15:43:28 +03:00
Carlos Zoido	ad130431aa	readme : add install instructions for Conan (#2189 )	2024-05-30 15:06:15 +03:00
Borislav Stanimirov	e130b66642	whisper: use global cache for sin/cos vals and Hann window (#2194 ) - also rename Hanning to Hann as it's named after Julius von Hann as per Wikipedia	2024-05-29 19:09:21 +03:00
Georgi Gerganov	c7b6988678	release : v1.6.2 v1.6.2	2024-05-27 10:35:09 +03:00
Georgi Gerganov	05042a782d	Revert "whisper : remove extra backend instance (huh?)" (#2182 ) This reverts commit `4caa64b73e`.	2024-05-27 10:20:25 +03:00
Daniel Valdivia	a7dc2aab16	server : fix typo (#2181 ) A simple comment typo, PR can be dismissed	2024-05-25 10:46:22 +03:00
Todd	22d46b7ba4	ruby : update bindings (#2154 ) * update library files * update whispercpp * not needed for gem	2024-05-22 23:02:52 +03:00
Georgi Gerganov	c10db6ea28	release : v1.6.1 v1.6.1	2024-05-21 18:44:37 +03:00
William Tambellini	1b51fdf170	examples : add support for decoding input with ffmpeg (Linux) (#2133 ) - search for ffmpeg libs/headers at cmake time - added ffmpeg-transcode.cpp into libcommon if ffmpeg on - hooked ffmpeg trancoding in common read_wav(...) - passed test: ./main -m ggml-base.en.bin -f samples/jfk.mp3	2024-05-21 18:31:41 +03:00
Pedro Probst	adee3f9c1f	node : add flash_attn param (#2170 )	2024-05-20 09:08:48 +03:00
Tamotsu Takahashi	4798be1f9a	ci: Update build.yml to suppress warnings about node.js versions (#2166 ) * Update actions to suppress warnings about old node.js https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/ * Update actions/upload-artifact, specify android cmdline-tools-version * Use java 20 gradle 8.1 complains against 21 https://docs.gradle.org/current/userguide/compatibility.html	2024-05-19 11:49:26 +03:00
Georgi Gerganov	08981d1bac	release : v1.6.0 v1.6.0	2024-05-15 09:59:48 +03:00
Georgi Gerganov	7094ea5e75	whisper : use flash attention (#2152 ) * whisper : use flash attention in the encoder * whisper : add kv_pad * whisper : remove extra backend instance (huh?) * whisper : use FA for cross-attention * whisper : use FA for self-attention * whisper : simplify encoder FA * whisper : add flash_attn runtime parameter * scripts : add bench log * scripts : add M1 Pro bench log	2024-05-15 09:38:19 +03:00
petterreinholdtsen	9d5771ae43	talk-llama : reject runs without required arguments (#2153 ) * Extended talk-llama example to reject runs without required arguments. Print warning and exit if models are not specified on the command line. * Update examples/talk-llama/talk-llama.cpp * Update examples/talk-llama/talk-llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-14 21:32:41 +03:00
Georgi Gerganov	f56b8305c4	sync : ggml	2024-05-14 19:16:32 +03:00
Georgi Gerganov	1056ad762c	metal : support FA without mask + add asserts (llama/7278) * ggml : fa without mask + add asserts ggml-ci * metal : support non-contiguous KV ggml-ci	2024-05-14 19:16:29 +03:00
Radoslav Gerganov	c451080c8b	ggml : add RPC backend (llama/6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos	2024-05-14 19:16:29 +03:00
Neo Zhang	8e7c22fbdb	rm wait() (llama/7233)	2024-05-14 19:16:29 +03:00
Johannes Gäßler	e57e95eb0d	CUDA: add FP32 FlashAttention vector kernel (llama/7188) * CUDA: add FP32 FlashAttention vector kernel * fixup! CUDA: add FP32 FlashAttention vector kernel * fixup! fixup! CUDA: add FP32 FlashAttention vector kernel * fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel	2024-05-14 19:16:29 +03:00

... 5 6 7 8 9 ...

1628 Commits