Commit Graph

8 Commits

Author SHA1 Message Date
Johannes Gäßler
a16137d13d CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
5582039d0a CUDA: quantized KV support for FA vec (llama/7527)
* CUDA: quantized KV support for FA vec

* try CI fix

* fix commented-out kernel variants

* add q8_0 q4_0 tests

* fix nwarps > batch size

* split fattn compile via extern templates

* fix flake8

* fix metal tests

* fix cmake

* make generate_cu_files.py executable

* add autogenerated .cu files

* fix AMD

* error if type_v != FP16 and not flash_attn

* remove obsolete code
2024-06-16 18:19:48 +03:00
Johannes Gäßler
96b8419b27 CUDA: fix FA out-of-bounds reads (llama/7479) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
3c63f4cf35 CUDA: fix FA out-of-bounds writes (llama/7465) 2024-06-16 18:19:48 +03:00
Georgi Gerganov
5848dfd9c8 cuda : fix compile warning (llama/7454) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
29ab5d0326 CUDA: remove incorrect precision check (llama/7454) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
45b5b95e29 CUDA: deduplicate FlashAttention code (llama/7352) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
ec52f900e4 CUDA: faster large batch FA without tensor cores (llama/7314) 2024-06-16 18:19:48 +03:00