9df53b357e
ggml : sync remnants (skip) ( #0 )
2024-12-08 22:48:25 +02:00
b2115b4d9b
scripts : remove amx from sync
2024-12-08 22:48:14 +02:00
0164427dd5
ci : disable freeBSD builds [no ci]
2024-12-08 20:14:35 +02:00
627b11c78a
readme : update build instructions
2024-12-08 20:14:35 +02:00
472464453d
ci : disable CUDA and Android builds
2024-12-08 20:14:35 +02:00
11dddfbc9e
ci : disable Obj-C build + fixes
2024-12-08 20:14:35 +02:00
384e214cc7
make : shim cmake
2024-12-08 20:14:35 +02:00
f2c680f893
talk-llama : sync llama.cpp
2024-12-08 20:14:35 +02:00
fbe66da0e5
sync : ggml
2024-12-08 20:14:35 +02:00
a815940e0e
ggml : add predefined list of CPU backend variants to build (llama/10626)
...
* ggml : add predefined list of CPU backend variants to build
* update CPU dockerfiles
2024-12-08 20:14:35 +02:00
904e307bce
ggml-cpu : fix HWCAP2_I8MM value (llama/10646)
2024-12-08 20:14:35 +02:00
491ec076b4
vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (llama/10642)
2024-12-08 20:14:35 +02:00
966433fdf2
SYCL : Move to compile time oneMKL interface backend selection for NVIDIA backend (llama/10584)
...
* [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend
Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.
Signed-off-by: nscipione <nicolo.scipione@codeplay.com >
* Formatting
* Address PR comments to increase readibility
---------
Signed-off-by: nscipione <nicolo.scipione@codeplay.com >
2024-12-08 20:14:35 +02:00
6f1ba9d82d
Avoid using __fp16 on ARM with old nvcc (llama/10616)
2024-12-08 20:14:35 +02:00
015ecd0001
vulkan: optimize and reenable split_k (llama/10637)
...
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
2024-12-08 20:14:35 +02:00
b7c64a4352
ggml: add GGML_SET
Metal kernel + i32 CPU kernel (ggml/1037)
...
* implemented cpu kernel
* add i32 test cases in test-backend-ops
* typedef `ggml_metal_kargs_set`
* implemented `kernel_set`
* memcpy
2024-12-08 20:14:35 +02:00
7895d39508
ggml : add GGML_PAD_REFLECT_1D
operation (ggml/1034)
...
* ggml_pad_reflect_1d defined in header
* implemented on CPU
* called the forward pass
* impl Metal kernel
* added Metal kernel
* added OP_PAD_REFLECT_1D in test-backend-ops.cpp
* add test-pad-reflect-1d test case
* test case support multiple backend
2024-12-08 20:14:35 +02:00
22616f00f9
files : remove make artifacts
2024-12-08 20:14:35 +02:00
02c6fcbc2c
common : fix compile warning
...
ggml-ci
2024-12-08 20:14:35 +02:00
3daeacad24
ggml : move AMX to the CPU backend (llama/10570)
...
ggml : automatic selection of best CPU backend (llama/10606)
2024-12-08 20:14:35 +02:00
4d73962da4
metal : small-batch mat-mul kernels (llama/10581)
...
* metal : small-batch mat-mul kernels
ggml-ci
* metal : add rest of types
ggml-ci
* metal : final adjustments
ggml-ci
* metal : add comments
ggml-ci
2024-12-08 20:14:35 +02:00
068812650e
SYCL: Fix and switch to GGML_LOG system instead of fprintf (llama/10579)
...
* Switched to GGML_LOG
* Fix missing semicolon
2024-12-08 20:14:35 +02:00
4b7e059e15
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() (llama/10567)
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
2024-12-08 20:14:35 +02:00
30e35d7271
vulkan: Dynamic subgroup size support for Q6_K mat_vec (llama/10536)
...
* subgroup 64 version with subgroup add. 15% faster
scalable version
tested for subgroup sizes 16-128
* check for subgroup multiple of 16 and greater than 16
* subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45 )
* force 16 sequential threads per block
* make 16 subgroup size a constant
2024-12-08 20:14:35 +02:00
3623bd58f2
ggml : fix I8MM Q4_1 scaling factor conversion (llama/10562)
...
ggml-ci
2024-12-08 20:14:35 +02:00
cb847c20a7
ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (llama/10580)
2024-12-08 20:14:35 +02:00
964b154a2a
sycl : offload of get_rows set to 0 (llama/10432)
2024-12-08 20:14:35 +02:00
d7c2a04bce
sycl : Reroute permuted mul_mats through oneMKL (llama/10408)
...
This PR fixes the failing MUL_MAT tests for the sycl backend.
2024-12-08 20:14:35 +02:00
2bb4ca9cba
CANN: RoPE operator optimization (llama/10563)
...
* [cann] RoPE operator optimization
* [CANN]Code Formatting
---------
Co-authored-by: noemotiovon <noemotiovon@gmail.com >
2024-12-08 20:14:35 +02:00
a753a82462
vulkan: get the first command buffer submitted sooner (llama/10499)
...
This is an incremental improvement over #9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.
With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
2024-12-08 20:14:35 +02:00
276b08d8f0
ggml : remove redundant copyright notice + update authors
2024-12-08 20:14:35 +02:00
4ca1e72fe0
ggml : fix row condition for i8mm kernels (llama/10561)
...
ggml-ci
2024-12-08 20:14:35 +02:00
16a66f103f
cmake : fix ARM feature detection (llama/10543)
...
ggml-ci
2024-12-08 20:14:35 +02:00
330273901f
ggml-cpu: support IQ4_NL_4_4 by runtime repack (llama/10541)
...
* ggml-cpu: support IQ4_NL_4_4 by runtime repack
* ggml-cpu: add __ARM_FEATURE_DOTPROD guard
2024-12-08 20:14:35 +02:00
42099a9342
kompute : improve backend to pass test_backend_ops (llama/10542)
...
* kompute: op_unary: reject unsupported parameters
Signed-off-by: Sergio Lopez <slp@redhat.com >
* kompute: softmax: implement ALiBi support
Signed-off-by: Sergio Lopez <slp@redhat.com >
* kompute: rope: implement neox and phi3 support
Signed-off-by: Sergio Lopez <slp@redhat.com >
* kompute: op_mul_mat_q4_k permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com >
* kompute: op_mul_mat_[q4_0|q4_1|q8_0] permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com >
* kompute: op_mul_mat_f16 permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com >
* kompute: op_mul_mat_q6_k permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com >
---------
Signed-off-by: Sergio Lopez <slp@redhat.com >
2024-12-08 20:14:35 +02:00
90dd5fca9c
CANN: Fix SOC_TYPE compile bug (llama/10519)
...
* CANN: Fix the bug build fail on Ascend310P under two cases:
1) Manual specify SOC_TYPE
2) Under some unusual compile environment
* Update the cann backend News content: Support F16 and F32 data type model for Ascend 310P NPU.
* fix CANN compile fail bug: the assert in ascend kernel function doesn't supportted on some CANN version
2024-12-08 20:14:35 +02:00
2490f2a7f8
CANN: ROPE operator optimization (llama/10540)
...
* [cann] ROPE operator optimization
Co-authored-by: noemotiovon <noemotiovon@gmail.com >
2024-12-08 20:14:35 +02:00
230e985633
Add some minimal optimizations for CDNA (llama/10498)
...
* Add some minimal optimizations for CDNA
* ggml_cuda: set launch bounds also for GCN as it helps there too
2024-12-08 20:14:35 +02:00
ae24083f23
metal : fix group_norm support condition (llama/0)
2024-12-08 20:14:35 +02:00
6463e36369
vulkan: define all quant data structures in types.comp (llama/10440)
2024-12-08 20:14:35 +02:00
b3301f7d82
vulkan: Handle GPUs with less shared memory (llama/10468)
...
There have been reports of failure to compile on systems with <= 32KB
of shared memory (e.g. #10037 ). This change makes the large tile size
fall back to a smaller size if necessary, and makes mul_mat_id fall
back to CPU if there's only 16KB of shared memory.
2024-12-08 20:14:35 +02:00
ab5d4d93ec
vulkan: further optimize q5_k mul_mat_vec (llama/10479)
2024-12-08 20:14:35 +02:00
2d6e9dd723
vulkan: skip integer div/mod in get_offsets for batch_idx==0 (llama/10506)
2024-12-08 20:14:35 +02:00
2f16e51553
vulkan: optimize Q2_K and Q3_K mul_mat_vec (llama/10459)
2024-12-08 20:14:35 +02:00
0f0994902f
mtgpu: Add MUSA_DOCKER_ARCH in Dockerfiles && update cmake and make (llama/10516)
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2024-12-08 20:14:35 +02:00
5e1fcc1780
vulkan: fix group_norm (llama/10496)
...
Fix bad calculation of the end of the range. Add a backend test that
covers the bad case (taken from stable diffusion).
Fixes https://github.com/leejet/stable-diffusion.cpp/issues/439 .
2024-12-08 20:14:35 +02:00
48f421de23
cmake : enable warnings in llama (llama/10474)
...
* cmake : enable warnings in llama
ggml-ci
* cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS
* cmake : get_flags -> ggml_get_flags
* speculative-simple : fix warnings
* cmake : reuse ggml_get_flags
ggml-ci
* speculative-simple : fix compile warning
ggml-ci
2024-12-08 20:14:35 +02:00
e7afb2b991
ggml-cpu: cmake add arm64 cpu feature check for macos (llama/10487)
...
* ggml-cpu: cmake add arm64 cpu feature check for macos
* use vmmlaq_s32 for compile option i8mm check
2024-12-08 20:14:35 +02:00
9a5ef7b169
CANN: Improve the Inferencing Performance for Ascend NPU Device (llama/10454)
...
* improve inferencing performance for ascend npu.
Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com >
* some modification after review
* some modifications after review
* restore some modifications
* restore some modifications
---------
Co-authored-by: shanshan shen <shanshanshen333@gmail.com >
Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com >
2024-12-08 20:14:35 +02:00
453cc0fcf1
CANN: RoPE and CANCAT operator optimization (llama/10488)
...
Co-authored-by: noemotiovon <noemotiovon@gmail.com >
2024-12-08 20:14:35 +02:00