Jeff Bolz
b74b68212a
vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (llama/10206)
2024-12-18 12:52:16 +02:00
Georgi Gerganov
94e7da1ff2
cmake : fix "amd64" processor string ( #2638 )
2024-12-17 18:34:32 +02:00
gn64
c4aed6831e
vulkan : fix soft_max.comp division by zero ( #2633 )
...
CI / ubuntu-latest-gcc (linux/ppc64le, Debug) (push) Has been cancelled
CI / ubuntu-latest-gcc (linux/ppc64le, Release) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/amd64, Debug) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/amd64, Release) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/arm64, Debug) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/arm64, Release) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/ppc64le, Debug) (push) Has been cancelled
CI / ubuntu-latest-clang (linux/ppc64le, Release) (push) Has been cancelled
CI / ubuntu-latest-gcc-sanitized (linux/amd64, ADDRESS) (push) Has been cancelled
CI / ubuntu-latest-gcc-sanitized (linux/amd64, THREAD) (push) Has been cancelled
CI / ubuntu-latest-gcc-sanitized (linux/amd64, UNDEFINED) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/amd64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/arm64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/amd64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm/v7, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/arm64, icx, icpx, ON) (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (linux/ppc64le, icx, icpx, ON) (push) Has been cancelled
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled
CI / windows (Win32, Release, win32-x86, x86, 2.28.5, ON) (push) Has been cancelled
CI / windows (x64, Release, win32-x86-64, x64, 2.28.5, ON) (push) Has been cancelled
CI / windows-blas (Win32, ON, Release, x86, 2.28.5, ON) (push) Has been cancelled
CI / windows-blas (x64, ON, Release, x64, 2.28.5, ON) (push) Has been cancelled
CI / emscripten (Release) (push) Has been cancelled
CI / ios-xcode-build (Release) (push) Has been cancelled
CI / android (push) Has been cancelled
CI / quantize (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/main.Dockerfile platform:linux/amd64,linux/arm64 tag:main]) (push) Has been cancelled
This change prevents a division by zero error when p.KY is 0.
2024-12-16 12:34:38 +02:00
Georgi Gerganov
7d134e3737
ggml : remove old files (skip) ( #0 )
2024-12-08 23:04:26 +02:00
Georgi Gerganov
9df53b357e
ggml : sync remnants (skip) ( #0 )
2024-12-08 22:48:25 +02:00
Diego Devesa
a815940e0e
ggml : add predefined list of CPU backend variants to build (llama/10626)
...
* ggml : add predefined list of CPU backend variants to build
* update CPU dockerfiles
2024-12-08 20:14:35 +02:00
Diego Devesa
904e307bce
ggml-cpu : fix HWCAP2_I8MM value (llama/10646)
2024-12-08 20:14:35 +02:00
Jeff Bolz
491ec076b4
vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (llama/10642)
2024-12-08 20:14:35 +02:00
Nicolò Scipione
966433fdf2
SYCL : Move to compile time oneMKL interface backend selection for NVIDIA backend (llama/10584)
...
* [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend
Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
* Formatting
* Address PR comments to increase readibility
---------
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
2024-12-08 20:14:35 +02:00
Frankie Robertson
6f1ba9d82d
Avoid using __fp16 on ARM with old nvcc (llama/10616)
2024-12-08 20:14:35 +02:00
Jeff Bolz
015ecd0001
vulkan: optimize and reenable split_k (llama/10637)
...
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
2024-12-08 20:14:35 +02:00
PAB
b7c64a4352
ggml: add GGML_SET
Metal kernel + i32 CPU kernel (ggml/1037)
...
* implemented cpu kernel
* add i32 test cases in test-backend-ops
* typedef `ggml_metal_kargs_set`
* implemented `kernel_set`
* memcpy
2024-12-08 20:14:35 +02:00
PAB
7895d39508
ggml : add GGML_PAD_REFLECT_1D
operation (ggml/1034)
...
* ggml_pad_reflect_1d defined in header
* implemented on CPU
* called the forward pass
* impl Metal kernel
* added Metal kernel
* added OP_PAD_REFLECT_1D in test-backend-ops.cpp
* add test-pad-reflect-1d test case
* test case support multiple backend
2024-12-08 20:14:35 +02:00
Georgi Gerganov
22616f00f9
files : remove make artifacts
2024-12-08 20:14:35 +02:00
Diego Devesa
3daeacad24
ggml : move AMX to the CPU backend (llama/10570)
...
ggml : automatic selection of best CPU backend (llama/10606)
2024-12-08 20:14:35 +02:00
Georgi Gerganov
4d73962da4
metal : small-batch mat-mul kernels (llama/10581)
...
* metal : small-batch mat-mul kernels
ggml-ci
* metal : add rest of types
ggml-ci
* metal : final adjustments
ggml-ci
* metal : add comments
ggml-ci
2024-12-08 20:14:35 +02:00
Akarshan Biswas
068812650e
SYCL: Fix and switch to GGML_LOG system instead of fprintf (llama/10579)
...
* Switched to GGML_LOG
* Fix missing semicolon
2024-12-08 20:14:35 +02:00
Adrien Gallouët
4b7e059e15
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() (llama/10567)
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-12-08 20:14:35 +02:00
Eve
30e35d7271
vulkan: Dynamic subgroup size support for Q6_K mat_vec (llama/10536)
...
* subgroup 64 version with subgroup add. 15% faster
scalable version
tested for subgroup sizes 16-128
* check for subgroup multiple of 16 and greater than 16
* subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45 )
* force 16 sequential threads per block
* make 16 subgroup size a constant
2024-12-08 20:14:35 +02:00
Georgi Gerganov
3623bd58f2
ggml : fix I8MM Q4_1 scaling factor conversion (llama/10562)
...
ggml-ci
2024-12-08 20:14:35 +02:00
Shupei Fan
cb847c20a7
ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (llama/10580)
2024-12-08 20:14:35 +02:00
Alberto Cabrera Pérez
964b154a2a
sycl : offload of get_rows set to 0 (llama/10432)
2024-12-08 20:14:35 +02:00
Alberto Cabrera Pérez
d7c2a04bce
sycl : Reroute permuted mul_mats through oneMKL (llama/10408)
...
This PR fixes the failing MUL_MAT tests for the sycl backend.
2024-12-08 20:14:35 +02:00
Chenguang Li
2bb4ca9cba
CANN: RoPE operator optimization (llama/10563)
...
* [cann] RoPE operator optimization
* [CANN]Code Formatting
---------
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-12-08 20:14:35 +02:00
Jeff Bolz
a753a82462
vulkan: get the first command buffer submitted sooner (llama/10499)
...
This is an incremental improvement over #9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.
With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
2024-12-08 20:14:35 +02:00
Georgi Gerganov
276b08d8f0
ggml : remove redundant copyright notice + update authors
2024-12-08 20:14:35 +02:00
Georgi Gerganov
4ca1e72fe0
ggml : fix row condition for i8mm kernels (llama/10561)
...
ggml-ci
2024-12-08 20:14:35 +02:00
Georgi Gerganov
16a66f103f
cmake : fix ARM feature detection (llama/10543)
...
ggml-ci
2024-12-08 20:14:35 +02:00
Shupei Fan
330273901f
ggml-cpu: support IQ4_NL_4_4 by runtime repack (llama/10541)
...
* ggml-cpu: support IQ4_NL_4_4 by runtime repack
* ggml-cpu: add __ARM_FEATURE_DOTPROD guard
2024-12-08 20:14:35 +02:00
Sergio López
42099a9342
kompute : improve backend to pass test_backend_ops (llama/10542)
...
* kompute: op_unary: reject unsupported parameters
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: softmax: implement ALiBi support
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: rope: implement neox and phi3 support
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: op_mul_mat_q4_k permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: op_mul_mat_[q4_0|q4_1|q8_0] permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: op_mul_mat_f16 permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: op_mul_mat_q6_k permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com>
---------
Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-12-08 20:14:35 +02:00
leo-pony
90dd5fca9c
CANN: Fix SOC_TYPE compile bug (llama/10519)
...
* CANN: Fix the bug build fail on Ascend310P under two cases:
1) Manual specify SOC_TYPE
2) Under some unusual compile environment
* Update the cann backend News content: Support F16 and F32 data type model for Ascend 310P NPU.
* fix CANN compile fail bug: the assert in ascend kernel function doesn't supportted on some CANN version
2024-12-08 20:14:35 +02:00
Chenguang Li
2490f2a7f8
CANN: ROPE operator optimization (llama/10540)
...
* [cann] ROPE operator optimization
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-12-08 20:14:35 +02:00
uvos
230e985633
Add some minimal optimizations for CDNA (llama/10498)
...
* Add some minimal optimizations for CDNA
* ggml_cuda: set launch bounds also for GCN as it helps there too
2024-12-08 20:14:35 +02:00
Georgi Gerganov
ae24083f23
metal : fix group_norm support condition (llama/0)
2024-12-08 20:14:35 +02:00
Jeff Bolz
6463e36369
vulkan: define all quant data structures in types.comp (llama/10440)
2024-12-08 20:14:35 +02:00
Jeff Bolz
b3301f7d82
vulkan: Handle GPUs with less shared memory (llama/10468)
...
There have been reports of failure to compile on systems with <= 32KB
of shared memory (e.g. #10037 ). This change makes the large tile size
fall back to a smaller size if necessary, and makes mul_mat_id fall
back to CPU if there's only 16KB of shared memory.
2024-12-08 20:14:35 +02:00
Jeff Bolz
ab5d4d93ec
vulkan: further optimize q5_k mul_mat_vec (llama/10479)
2024-12-08 20:14:35 +02:00
Jeff Bolz
2d6e9dd723
vulkan: skip integer div/mod in get_offsets for batch_idx==0 (llama/10506)
2024-12-08 20:14:35 +02:00
Jeff Bolz
2f16e51553
vulkan: optimize Q2_K and Q3_K mul_mat_vec (llama/10459)
2024-12-08 20:14:35 +02:00
R0CKSTAR
0f0994902f
mtgpu: Add MUSA_DOCKER_ARCH in Dockerfiles && update cmake and make (llama/10516)
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-12-08 20:14:35 +02:00
Jeff Bolz
5e1fcc1780
vulkan: fix group_norm (llama/10496)
...
Fix bad calculation of the end of the range. Add a backend test that
covers the bad case (taken from stable diffusion).
Fixes https://github.com/leejet/stable-diffusion.cpp/issues/439 .
2024-12-08 20:14:35 +02:00
Georgi Gerganov
48f421de23
cmake : enable warnings in llama (llama/10474)
...
* cmake : enable warnings in llama
ggml-ci
* cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS
* cmake : get_flags -> ggml_get_flags
* speculative-simple : fix warnings
* cmake : reuse ggml_get_flags
ggml-ci
* speculative-simple : fix compile warning
ggml-ci
2024-12-08 20:14:35 +02:00
Charles Xu
e7afb2b991
ggml-cpu: cmake add arm64 cpu feature check for macos (llama/10487)
...
* ggml-cpu: cmake add arm64 cpu feature check for macos
* use vmmlaq_s32 for compile option i8mm check
2024-12-08 20:14:35 +02:00
Shanshan Shen
9a5ef7b169
CANN: Improve the Inferencing Performance for Ascend NPU Device (llama/10454)
...
* improve inferencing performance for ascend npu.
Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>
* some modification after review
* some modifications after review
* restore some modifications
* restore some modifications
---------
Co-authored-by: shanshan shen <shanshanshen333@gmail.com>
Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>
2024-12-08 20:14:35 +02:00
Chenguang Li
453cc0fcf1
CANN: RoPE and CANCAT operator optimization (llama/10488)
...
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-12-08 20:14:35 +02:00
Junil Kim
78dfec6bc5
vulkan: Fix a vulkan-shaders-gen arugment parsing error (llama/10484)
...
The vulkan-shaders-gen was not parsing the --no-clean argument correctly.
Because the previous code was parsing the arguments which have a value only
and the --no-clean argument does not have a value, it was not being parsed
correctly. This commit can now correctly parse arguments that don't have values.
2024-12-08 20:14:35 +02:00
Georgi Gerganov
f6d518fc4c
metal : enable mat-vec kernels for bs <= 4 (llama/10491)
2024-12-08 20:14:35 +02:00
Diego Devesa
ac33379a35
llama : accept a list of devices to use to offload a model (llama/10497)
...
* llama : accept a list of devices to use to offload a model
* accept `--dev none` to completely disable offloading
* fix dev list with dl backends
* rename env parameter to LLAMA_ARG_DEVICE for consistency
2024-12-08 20:14:35 +02:00
Diego Devesa
77e3e4a090
ggml : add support for dynamic loading of backends (llama/10469)
...
* ggml : add support for dynamic loading of backends
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-08 20:14:35 +02:00
Georgi Gerganov
b840bb09be
metal : minor code formatting
2024-12-08 20:14:35 +02:00