whisper : add batched decoding (#1486)

* whisper : add whisper_batch * whisper : move kv_self to whisper_state * whisper : full batched decoding support * whisper : fix memory leak in whisper_batch * whisper : fix mem leak again + remove oboslete function * whisper : clear kv cache when using whisper_decode API * whisper : speed-up sampling * whisper : fix decoders initializer * bench : add batch size 5 bench * whisper : add comment about the KV cache size * whisper : add check for max number of decoders * whisper : avoid starting sampling threads with bs=1 * whisper : enable beam-search by default * cuda : sync llama.cpp fixes
2025-06-14 12:58:10 +00:00 · 2023-11-15 16:12:52 +02:00
parent d4231649e6
commit b6c5f49b78
7 changed files with 836 additions and 572 deletions
--- a/ggml-cuda.h
+++ b/ggml-cuda.h
@ -17,7 +17,12 @@ extern "C" {

 #define GGML_CUDA_MAX_DEVICES       16

+// Always success. To check if CUDA is actually loaded, use `ggml_cublas_loaded`.
 GGML_API void   ggml_init_cublas(void);
+
+// Returns `true` if there are available CUDA devices and cublas loads successfully; otherwise, it returns `false`.
+GGML_API bool   ggml_cublas_loaded(void);
+
 GGML_API void * ggml_cuda_host_malloc(size_t size);
 GGML_API void   ggml_cuda_host_free(void * ptr);