Commit Graph

3 Commits

Author SHA1 Message Date
Johannes Gäßler
7483d2b61c CUDA: use tensor cores for MMQ (llama/7676)
* CUDA: int8 tensor cores for MMQ (legacy quants)

* fix out-of-bounds writes

* __builtin_assume -> GGML_CUDA_ASSUME

* fix writeback returning too early
2024-06-16 18:19:48 +03:00
Johannes Gäßler
a16137d13d CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
5582039d0a CUDA: quantized KV support for FA vec (llama/7527)
* CUDA: quantized KV support for FA vec

* try CI fix

* fix commented-out kernel variants

* add q8_0 q4_0 tests

* fix nwarps > batch size

* split fattn compile via extern templates

* fix flake8

* fix metal tests

* fix cmake

* make generate_cu_files.py executable

* add autogenerated .cu files

* fix AMD

* error if type_v != FP16 and not flash_attn

* remove obsolete code
2024-06-16 18:19:48 +03:00