Johannes Gäßler
c262dc80e2
CPU/CUDA: fix (GQA) mul mat back, add CUDA support (llama/11380)
2025-02-03 22:00:57 +02:00
Jeff Bolz
7183a1eb72
vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (llama/11166)
...
* vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl
Shaders are based on cpy.cu.
* vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32
* ggml: copy q->f32 assumes some contiguity in the destination
2025-02-03 22:00:57 +02:00
Johannes Gäßler
de49024e49
CUDA: backwards pass for misc. ops, add tests (llama/11257)
...
* CUDA: backwards pass for misc. ops, add tests
* remove restrict from pointers
2025-02-03 22:00:57 +02:00
fj-y-saito
db6383094c
ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot (llama/11227)
...
* Add SVE support for q4_K_q8_K
* Update ggml/src/ggml-cpu/ggml-cpu-quants.c
change to use K_SCALE_SIZE
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-03 22:00:57 +02:00
Johannes Gäßler
54a2ee648f
RoPE: fix back, CUDA support for back + noncont. (llama/11240)
...
* RoPE: fix back, CUDA support for back + noncont.
* fix comments reg. non-cont. RoPE support [no-ci]
2025-02-03 22:00:57 +02:00
issixx
f12559d590
ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065)
...
some threads kept looping and failed to terminate properly after an abort during CPU execution.
Co-authored-by: issi <issi@gmail.com>
2025-02-03 22:00:57 +02:00
Molly Sophia
06209f6683
llama: add support for QRWKV6 model architecture (llama/11001)
...
llama: add support for QRWKV6 model architecture (llama/11001)
* WIP: Add support for RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV: Some graph simplification
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Add support for RWKV6Qwen2 with cpu and cuda GLA
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix some typos
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* code format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix wkv test & add gla test
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix cuda warning
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update README.md
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update ggml/src/ggml-cuda/gla.cu
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix fused lerp weights loading with RWKV6
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* better sanity check skipping for QRWKV6 in llama-quant
thanks @compilade
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2025-01-14 10:38:01 +02:00
amritahs-ibm
124eec1664
llamafile : ppc64le MMA INT8 implementation (llama/10912)
...
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.
This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2025-01-14 10:38:01 +02:00
Diego Devesa
09fabffdf5
ggml-backend : only offload from host buffers (fix) (llama/11124)
2025-01-14 10:38:01 +02:00
Srihari-mcw
3fcba3e58b
ggml : fixes for AVXVNNI instruction set with MSVC and Clang (llama/11027)
...
* Fixes for clang AVX VNNI
* enable AVX VNNI and alder lake build for MSVC
* Apply suggestions from code review
---------
Co-authored-by: slaren <slarengh@gmail.com>
2025-01-04 10:45:01 +02:00
Djip007
bcf937c216
ggml : more perfo with llamafile tinyblas on x86_64 (llama/10714)
...
* more perfo with llamafile tinyblas on x86_64.
- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth
simple tinyblas dispache and more cache freindly
* tinyblas dynamic dispaching
* sgemm: add M blocs.
* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2
* remove not stable test
2025-01-04 10:45:01 +02:00
Diego Devesa
b8d90953d7
ggml : use wstring for backend search paths (llama/10960)
...
ggml-ci
2025-01-04 10:45:01 +02:00
Diego Devesa
60a422147b
ggml : fix arm enabled features check (llama/10961)
2025-01-04 10:45:01 +02:00
Diego Devesa
3387415bad
ggml : fix const usage in SSE path (llama/10962)
2025-01-04 10:45:01 +02:00
Adrien Gallouët
6d502f33dc
ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() (llama/10874)
...
* ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0()
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml-cpu: format code
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-01-04 10:45:01 +02:00
Diego Devesa
1462d92588
ggml : add test for SVE and disable when it fails (llama/10906)
2025-01-04 10:45:01 +02:00
Adrien Gallouët
7ba1a41f47
ggml: fix arm build with gcc (llama/10895)
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-01-04 10:45:01 +02:00
Diego Devesa
5ea088636f
ggml : fix arm build (llama/10890)
...
* ggml: GGML_NATIVE uses -mcpu=native on ARM
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml: Show detected features with GGML_NATIVE
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* remove msvc support, add GGML_CPU_ARM_ARCH option
* disable llamafile in android example
* march -> mcpu, skip adding feature macros
ggml-ci
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Adrien Gallouët <angt@huggingface.co>
2025-01-04 10:45:01 +02:00
Georgi Gerganov
6576af00d7
files : remove old sources
2024-12-18 12:52:16 +02:00
Georgi Gerganov
479499dc0e
ggml : update ggml_backend_cpu_device_supports_op (llama/10867)
...
* ggml : fix cpy op for IQ-quants to use reference impl
ggml-ci
* ggml : disable tests involving i-matrix quantization
* ggml : update ggml_backend_cpu_device_supports_op
ggml-ci
2024-12-18 12:52:16 +02:00
HimariO
e22d38e4f2
llama : add Qwen2VL support + multimodal RoPE (llama/10361)
...
* Barebone Qwen2VL LLM convertor
* Add Qwen2VL cli entrypoint
* [WIP] add qwen2vl arch
* Verify m-rope output
* Add vl-rope/2d-rope support for qwen2vl ViT
* update qwen2vl cli tool
* update 5D tensor op workaround
* [WIP] qwen2vl vision model
* make batch and clip utils compatible with qwen2vl
* [WIP] create inference workflow, gguf convert script but fix
* correcting vision-rope behavior, add the missing last layer back to ViT
* add arg parser to qwen2vl_surgery
* replace variable size array with vector
* cuda-gdb cmake preset
* add fp32 mrope, vision rope kernel
* add fp16 support for qwen2vl and m-rope
* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`
* fix rope op mode switching, out dated func args
* update `llama_hparams`
* update to keep up stream changes
* resolve linter, test errors
* add makefile entry, update speical image padding token
* add mrope unit test, fix few compiler warnings
* rename `mrope` related function, params
* minor updates on debug util, bug fixs
* add `m-rope` testcase to `test-backend-ops`
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix traililng whitespce
* store `llama_hparams.rope_sections` with fixed size array
* update position id tensor size check in GGML_OP_ROPE
* minor updates
* update `ggml_backend_*_supports_op` of unsupported backends
* remote old `rope_section` compare operator
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-18 12:52:16 +02:00
Karol Kontny
e6eed605cf
ggml : Fix compilation issues on ARM platform when building without fp16 (llama/10811)
2024-12-18 12:52:16 +02:00
Diego Devesa
1193e494a9
remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (llama/10797)
...
other windows build fixes
2024-12-18 12:52:16 +02:00
Georgi Gerganov
d0a050b51f
ggml : disable iq4_nl interleave size 8 (llama/10709)
...
ggml-ci
2024-12-18 12:52:16 +02:00
Djip007
e990d1b791
ggml : refactor online repacking (llama/10446)
...
* rename ggml-cpu-aarch64.c to .cpp
* reformat extra cpu backend.
- clean Q4_0_N_M and IQ4_0_N_M
- remove from "file" tensor type
- allow only with dynamic repack
- extract cpu extra bufts and convert to C++
- hbm
- "aarch64"
- more generic use of extra buffer
- generalise extra_supports_op
- new API for "cpu-accel":
- amx
- aarch64
* clang-format
* Clean Q4_0_N_M ref
Enable restrict on C++
* add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack
* added/corrected control on tensor size for Q4 repacking.
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add debug logs on repacks.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-18 12:52:16 +02:00
Georgi Gerganov
94e7da1ff2
cmake : fix "amd64" processor string ( #2638 )
2024-12-17 18:34:32 +02:00
Diego Devesa
a815940e0e
ggml : add predefined list of CPU backend variants to build (llama/10626)
...
* ggml : add predefined list of CPU backend variants to build
* update CPU dockerfiles
2024-12-08 20:14:35 +02:00
Diego Devesa
904e307bce
ggml-cpu : fix HWCAP2_I8MM value (llama/10646)
2024-12-08 20:14:35 +02:00
PAB
b7c64a4352
ggml: add GGML_SET
Metal kernel + i32 CPU kernel (ggml/1037)
...
* implemented cpu kernel
* add i32 test cases in test-backend-ops
* typedef `ggml_metal_kargs_set`
* implemented `kernel_set`
* memcpy
2024-12-08 20:14:35 +02:00
PAB
7895d39508
ggml : add GGML_PAD_REFLECT_1D
operation (ggml/1034)
...
* ggml_pad_reflect_1d defined in header
* implemented on CPU
* called the forward pass
* impl Metal kernel
* added Metal kernel
* added OP_PAD_REFLECT_1D in test-backend-ops.cpp
* add test-pad-reflect-1d test case
* test case support multiple backend
2024-12-08 20:14:35 +02:00
Georgi Gerganov
22616f00f9
files : remove make artifacts
2024-12-08 20:14:35 +02:00
Diego Devesa
3daeacad24
ggml : move AMX to the CPU backend (llama/10570)
...
ggml : automatic selection of best CPU backend (llama/10606)
2024-12-08 20:14:35 +02:00
Adrien Gallouët
4b7e059e15
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() (llama/10567)
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-12-08 20:14:35 +02:00
Georgi Gerganov
3623bd58f2
ggml : fix I8MM Q4_1 scaling factor conversion (llama/10562)
...
ggml-ci
2024-12-08 20:14:35 +02:00
Shupei Fan
cb847c20a7
ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (llama/10580)
2024-12-08 20:14:35 +02:00
Georgi Gerganov
276b08d8f0
ggml : remove redundant copyright notice + update authors
2024-12-08 20:14:35 +02:00
Georgi Gerganov
4ca1e72fe0
ggml : fix row condition for i8mm kernels (llama/10561)
...
ggml-ci
2024-12-08 20:14:35 +02:00
Georgi Gerganov
16a66f103f
cmake : fix ARM feature detection (llama/10543)
...
ggml-ci
2024-12-08 20:14:35 +02:00
Shupei Fan
330273901f
ggml-cpu: support IQ4_NL_4_4 by runtime repack (llama/10541)
...
* ggml-cpu: support IQ4_NL_4_4 by runtime repack
* ggml-cpu: add __ARM_FEATURE_DOTPROD guard
2024-12-08 20:14:35 +02:00
Charles Xu
e7afb2b991
ggml-cpu: cmake add arm64 cpu feature check for macos (llama/10487)
...
* ggml-cpu: cmake add arm64 cpu feature check for macos
* use vmmlaq_s32 for compile option i8mm check
2024-12-08 20:14:35 +02:00
Diego Devesa
77e3e4a090
ggml : add support for dynamic loading of backends (llama/10469)
...
* ggml : add support for dynamic loading of backends
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-08 20:14:35 +02:00
Diego Devesa
8b1c1c30a7
ggml : do not use ARM features not included in the build (llama/10457)
2024-12-08 20:14:35 +02:00
haopeng
95e8901e71
add cmake rvv support (llama/10411)
2024-12-08 20:14:35 +02:00
FirstTimeEZ
45cf1634dc
ggml : fix undefined reference to 'getcpu' (llama/10354)
...
https://github.com/ggerganov/llama.cpp/issues/10352
2024-11-20 21:00:08 +02:00
Georgi Gerganov
d4fcdf602b
llamafile : fix include path (llama/0)
...
ggml-ci
2024-11-20 21:00:08 +02:00
Dan Johansson
ee437cde59
ggml : optimize Q4_0 into Q4_0_X_Y repack (llama/10324)
2024-11-20 21:00:08 +02:00
Srihari-mcw
c1506d38cf
Make updates to fix issues with clang-cl builds while using AVX512 flags (llama/10314)
2024-11-20 21:00:08 +02:00
Johannes Gäßler
c9541741e6
ggml: new optimization interface (ggml/988)
...
* ggml: new optimization interface
remove test2.c, test3.c
store adamw params in tensor
move grads from tensor to graph
* avoid segfault upon API misuse
* add ggml-opt.h to public headers
* remove dependence of ggml-opt.cpp on ggml-cpu.h
2024-11-20 21:00:08 +02:00
Georgi Gerganov
401fbea326
sync : leftovers (ggml/0)
...
ggml-ci
2024-11-20 21:00:08 +02:00
Eve
3216efef2e
AVX BF16 and single scale quant optimizations (llama/10212)
...
* use 128 bit loads (i've tried 256->128 to death and its slower)
* double accumulator
* avx bf16 vec dot
* +3% q4_0 inference
* +7% tg +5% pp compared to master
* slower f16c version, kep for reference
* 256b version, also slow. i tried :)
* revert f16
* faster with madd
* split to functions
* Q8_0 and IQ4_NL, 5-7% faster
* fix potential overflow (performance reduced)
* 16 bit add for q4_0 only
* merge
2024-11-20 21:00:08 +02:00