whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-05-28 04:54:13 +00:00

History

Gaurav Garg ae6a9bb9a5 CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)

- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

2025-03-27 11:06:03 +02:00

cmake

cmake: Comment out GGML_BIN_DIR for now (ggml/1139)

2025-03-27 11:06:03 +02:00

include

llama: Add support for RWKV v7 architecture (llama/12412)

2025-03-27 11:06:03 +02:00

src

CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)

2025-03-27 11:06:03 +02:00

.gitignore

whisper : reorganize source code + improve CMake (#2256 )

2024-06-26 19:34:09 +03:00

CMakeLists.txt

SYCL: using graphs is configurable by environment variable and compile option (llama/12371)

2025-03-27 11:06:03 +02:00