whisper.cpp/ggml
agray3 042e95d92f Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816)
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-11-01 10:19:05 +02:00
..
cmake whisper : reorganize source code + improve CMake (#2256) 2024-06-26 19:34:09 +03:00
include rpc : add backend registry / device interfaces (llama/9812) 2024-11-01 10:19:05 +02:00
src Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816) 2024-11-01 10:19:05 +02:00
.gitignore whisper : reorganize source code + improve CMake (#2256) 2024-06-26 19:34:09 +03:00
CMakeLists.txt cmake : do not hide GGML options + rename option (llama/9465) 2024-09-24 19:45:08 +03:00
ggml_vk_generate_shaders.py whisper : reorganize source code + improve CMake (#2256) 2024-06-26 19:34:09 +03:00