Johannes Gäßler
a16137d13d
CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681)
2024-06-16 18:19:48 +03:00
Johannes Gäßler
5582039d0a
CUDA: quantized KV support for FA vec (llama/7527)
...
* CUDA: quantized KV support for FA vec
* try CI fix
* fix commented-out kernel variants
* add q8_0 q4_0 tests
* fix nwarps > batch size
* split fattn compile via extern templates
* fix flake8
* fix metal tests
* fix cmake
* make generate_cu_files.py executable
* add autogenerated .cu files
* fix AMD
* error if type_v != FP16 and not flash_attn
* remove obsolete code
2024-06-16 18:19:48 +03:00
Georgi Gerganov
9a16c643e2
ggml : fix loongson compile warnings (llama/7537)
...
* ggml : fix loongson compile warnings
ggml-ci
* Fix loongarch quantize test fail.
Fix unexpected error introduced during rebase code.
* tests : disable json test due to lack of python on the CI node
ggml-ci
---------
Co-authored-by: junchao-loongson <zhaojunchao@loongson.cn>
2024-06-16 18:19:48 +03:00
Chris Elrod
10a8a23100
faster avx512 exp implementation (llama/7551)
...
* faster avx512 exp implementation
* x->r
* improve accuracy, handle special cases
* remove `e`
2024-06-16 18:19:48 +03:00
junchao-loongson
29cfeef77f
ggml : fix loongarch build (O2 issue) (llama/7636)
2024-06-16 18:19:48 +03:00
Georgi Gerganov
e66e9ea25b
metal : remove invalid asserts (llama/7617)
2024-06-16 18:19:48 +03:00
Georgi Gerganov
276779a849
metal : add missing asserts (llama/7617)
2024-06-16 18:19:48 +03:00
Georgi Gerganov
1f35ce61c1
ggml : fix YARN + add tests + add asserts (llama/7617)
...
* tests : add rope tests
ggml-ci
* ggml : fixes (hopefully)
ggml-ci
* tests : add non-cont tests
ggml-ci
* cuda : add asserts for rope/norm + fix DS2
ggml-ci
* ggml : assert contiguousness
* tests : reduce RoPE tests
ggml-ci
2024-06-16 18:19:48 +03:00
Georgi Gerganov
4b19cc3ed4
cuda : non-cont concat support (llama/7610)
...
* tests : add non-cont concat tests
* cuda : non-cont concat support
ggml-ci
2024-06-16 18:19:48 +03:00
Radoslav Gerganov
a535d348dd
llama-bench : add support for the RPC backend (llama/7435)
2024-06-16 18:19:48 +03:00
slaren
8f5dc729d9
ggml : use atomic_flag for critical section (llama/7598)
...
* ggml : use atomic_flag for critical section
* add windows shims
2024-06-16 18:19:48 +03:00
Georgi Gerganov
02fc147a0b
examples : adapt to new ggml_concat (ggml/0)
2024-06-16 18:19:48 +03:00
zhouwg
109148ac84
ggml : fix typo in ggml.c (llama/7603)
2024-06-16 18:19:48 +03:00
Meng, Hengyu
3563473d2c
Align GEMM dispatch (llama/7566)
...
* align GEMM dispatch
2024-06-16 18:19:48 +03:00
Georgi Gerganov
046834198d
sycl : fix assert (llama/7563)
2024-06-16 18:19:48 +03:00
k.h.lai
0a2ad9de06
vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (llama/7552)
2024-06-16 18:19:48 +03:00
Radoslav Gerganov
39b0640b09
rpc : resource management rework (llama/7562)
...
* rpc : resource management rework
* address review comments
2024-06-16 18:19:48 +03:00
Neo Zhang
8dca71de64
fix ggml_sycl_mul_mat_id() to match the change of api (llama/7436)
...
* fix mul_mat_id to match the change of api
* rm comment
* rm unused or duplicated code, rename as review comment
2024-06-16 18:19:48 +03:00
Georgi Gerganov
812787cbc5
ggml : generalize GGML_OP_CONCAT (llama/7563)
...
* ggml : generalize GGML_OP_CONCAT (WIP)
ggml-ci
* tests : add dim != 2 tests
* metal : generalize concat kernel
* tests : naming
* cuda : generalize concat kernel
ggml-ci
* sycl : add warning and assert
* ggml : fix op params handling
* metal : bugfix kernel
ggml-ci
* ggml : reimplement CPU and Metal
* cuda : add asserts
ggml-ci
* ggml : fix ptrs
ggml-ci
2024-06-16 18:19:48 +03:00
Djip007
68ef10805e
update HIP_UMA #7399 (llama/7414)
...
* update HIP_UMA #7399
add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable.
- get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103)
* simplify code, more consistent style
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-06-16 18:19:48 +03:00
agray3
96fdb90f5f
Allow multiple copy function pointers for CUDA graph kernel param updates (llama/7565)
...
CUDA graphs require parameter updates to kernels associated with
GGML_OP_CPY nodes. Previously the implementation only checked for a
single CUDA kernel in such nodes, but this caused a bug in cases where
2 such kernels exist. This fixes the issue by using a vector to allow
multiple function pointers to be stored and checked against.
Fixes #7942
2024-06-16 18:19:48 +03:00
AidanBeltonS
e98f9ac554
Fix q_xxs using mul_mat_q (llama/7459)
2024-06-16 18:19:48 +03:00
AidanBeltonS
02d481595b
Add freq factors (llama/7495)
2024-06-16 18:19:48 +03:00
Georgi Gerganov
7091c7ab5a
metal : add GGML_OP_REPEAT kernels (llama/7557)
...
ggml-ci
2024-06-16 18:19:48 +03:00
Georgi Gerganov
d70ccb75f5
metal : disable FA kernel for HS=256 (llama/7556)
...
ggml-ci
2024-06-16 18:19:48 +03:00
Georgi Gerganov
5ee048eb67
ggml : restore ggml_rope_xpos_inplace (ggml/0)
...
ggml-ci
2024-06-16 18:19:48 +03:00
Masaya, Kato
37ed71c964
ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (llama/7433)
...
* Add SVE support for q4_0_q8_0 q8_0_q8_0
* remove ifdef
2024-06-16 18:19:48 +03:00
Georgi Gerganov
8cd7a3df37
ggml : silence UB sanitizer error during iq2_xxs quantization (llama/0)
2024-06-16 18:19:48 +03:00
Georgi Gerganov
04a3279320
ggml : remove ggml_flash_attn and ggml_flash_ff (llama/7463)
...
ggml-ci
2024-06-16 18:19:48 +03:00
Georgi Gerganov
45ddda8e0c
ggml : drop support for QK_K=64 (llama/7473)
...
* ggml : drop support for QK_K=64
ggml-ci
* opencl : restore QK_K=256 define
2024-06-16 18:19:48 +03:00
0cc4m
c41317fd66
Update vulkan rope implementation to support frequency factors (llama/7475)
2024-06-16 18:19:48 +03:00
Johannes Gäßler
96b8419b27
CUDA: fix FA out-of-bounds reads (llama/7479)
2024-06-16 18:19:48 +03:00
Johannes Gäßler
3c63f4cf35
CUDA: fix FA out-of-bounds writes (llama/7465)
2024-06-16 18:19:48 +03:00
Georgi Gerganov
5848dfd9c8
cuda : fix compile warning (llama/7454)
2024-06-16 18:19:48 +03:00
Johannes Gäßler
29ab5d0326
CUDA: remove incorrect precision check (llama/7454)
2024-06-16 18:19:48 +03:00
Georgi Gerganov
c4d6958b3e
cuda : fix rope + add tests (llama/7452)
...
* cuda : fix rope pos data
ggml-ci
* ggml : drop mode & 1 == 1 support for ggml_rope
ggml-ci
* ggml : support freq_factors for f16 rope (CPU)
ggml-ci
* tests : add rope tests using frequency factors
ggml-ci
2024-06-16 18:19:48 +03:00
liuwei-git
c9dcb75118
llama : add phi3 128K model support (llama/7225)
...
* add phi3 128k support in convert-hf-to-gguf
* add phi3 128k support in cuda
* address build warnings on llama.cpp
* adjust index value in cuda long rope freq factors
* add long rope support in ggml cpu backend
* make freq factors only depend on ctx size
* remove unused rope scaling type 'su' frin gguf converter
* fix flint warnings on convert-hf-to-gguf.py
* set to the short freq factor when context size is small than trained context size
* add one line of comments
* metal : support rope freq_factors
* ggml : update ggml_rope_ext API to support freq. factors
* backends : add dev messages to support rope freq. factors
* minor : style
* tests : update to use new rope API
* backends : fix pragma semicolons
* minor : cleanup
* llama : move rope factors from KV header to tensors
* llama : remove tmp assert
* cuda : fix compile warning
* convert : read/write n_head_kv
* llama : fix uninitialized tensors
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
Georgi Gerganov
bbdbc3fc62
metal : handle F16 inf values, fix FA partial offload (llama/7434)
...
ggml-ci
2024-06-16 18:19:48 +03:00
Johannes Gäßler
28c207a541
CUDA: fix unused warning in mmq.cu (llama/7442)
2024-06-16 18:19:48 +03:00
Johannes Gäßler
c23f830983
CUDA: deduplicate mmq code (llama/7397)
2024-06-16 18:19:48 +03:00
Radoslav Gerganov
caeeb32b41
rpc : track allocated buffers (llama/7411)
...
* rpc : track allocated buffers
ref: #7407
* rpc : pack rpc_tensor tightly
2024-06-16 18:19:48 +03:00
AidanBeltonS
584cc1177a
Update SYCL upscale operation (llama/7321)
...
* Update SYCL upscale operation
* Formatting
* Remove messages
2024-06-16 18:19:48 +03:00
Herman Semenov
cc1ae10989
ggml-opencl, llama: using reserve() if count already known (llama/7272)
2024-06-16 18:19:48 +03:00
junchao-loongson
eb26f55b40
ggml : add loongarch lsx and lasx support (llama/6454)
...
* add loongarch lsx and lasx optimize code
* Add loongarch compilation support to makefile
* revert stb_image.h
* opt bytes_from_nibbles_32 and sum_i16_pairs_float
* fix undeclared
* format code
* update
* update 2
---------
Co-authored-by: Jinyang He <hejinyang@loongson.cn>
2024-06-16 18:19:48 +03:00
Srihari-mcw
eb2b086584
Add provisions for windows support for BF16 code including CMake provision for enabling AVX512_BF16 (llama/7258)
2024-06-16 18:19:48 +03:00
0cc4m
67919cfe11
Vulkan Embedding Fix (llama/7360)
...
* Fix empty Vulkan host buffers
Add fp32 fp16 matmul shader
Fix matmul shader alignment
* Remove deprecated tensor->backend uses
* Fix Vulkan validation errors on embedding models with no offloaded layers
* Fix Vulkan llava segfault when not offloading layers
2024-06-16 18:19:48 +03:00
slaren
bf5fc81a8a
ggml : fix another case of quants nans (llama/7387)
2024-06-16 18:19:48 +03:00
Johannes Gäßler
2b07dc3186
ggml: implement quantized KV cache for FA (llama/7372)
2024-06-16 18:19:48 +03:00
slaren
951c463d39
cuda : clear error after buffer allocation failure (llama/7376)
2024-06-16 18:19:48 +03:00
fraxy-v
7f257b210f
Capture CUDA logging output (llama/7298)
...
* logging: output capture in cuda module
* fix compile error
* fix: vsnprintf terminates with 0, string use not correct
* post review
* Update llama.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* Update llama.cpp
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-06-16 18:19:48 +03:00