Georgi Gerganov
ed733e85a1
scripts : update to new build system
2024-12-09 11:30:16 +02:00
Georgi Gerganov
5980b1ae77
devops : add cmake
2024-12-08 23:09:26 +02:00
Georgi Gerganov
0415a66044
devops : update make commands
2024-12-08 23:07:29 +02:00
Georgi Gerganov
7d134e3737
ggml : remove old files (skip) ( #0 )
2024-12-08 23:04:26 +02:00
Georgi Gerganov
9df53b357e
ggml : sync remnants (skip) ( #0 )
2024-12-08 22:48:25 +02:00
Georgi Gerganov
b2115b4d9b
scripts : remove amx from sync
2024-12-08 22:48:14 +02:00
Georgi Gerganov
0164427dd5
ci : disable freeBSD builds [no ci]
2024-12-08 20:14:35 +02:00
Georgi Gerganov
627b11c78a
readme : update build instructions
2024-12-08 20:14:35 +02:00
Georgi Gerganov
472464453d
ci : disable CUDA and Android builds
2024-12-08 20:14:35 +02:00
Georgi Gerganov
11dddfbc9e
ci : disable Obj-C build + fixes
2024-12-08 20:14:35 +02:00
Georgi Gerganov
384e214cc7
make : shim cmake
2024-12-08 20:14:35 +02:00
Georgi Gerganov
f2c680f893
talk-llama : sync llama.cpp
2024-12-08 20:14:35 +02:00
Georgi Gerganov
fbe66da0e5
sync : ggml
2024-12-08 20:14:35 +02:00
Diego Devesa
a815940e0e
ggml : add predefined list of CPU backend variants to build (llama/10626)
...
* ggml : add predefined list of CPU backend variants to build
* update CPU dockerfiles
2024-12-08 20:14:35 +02:00
Diego Devesa
904e307bce
ggml-cpu : fix HWCAP2_I8MM value (llama/10646)
2024-12-08 20:14:35 +02:00
Jeff Bolz
491ec076b4
vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (llama/10642)
2024-12-08 20:14:35 +02:00
Nicolò Scipione
966433fdf2
SYCL : Move to compile time oneMKL interface backend selection for NVIDIA backend (llama/10584)
...
* [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend
Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
* Formatting
* Address PR comments to increase readibility
---------
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
2024-12-08 20:14:35 +02:00
Frankie Robertson
6f1ba9d82d
Avoid using __fp16 on ARM with old nvcc (llama/10616)
2024-12-08 20:14:35 +02:00
Jeff Bolz
015ecd0001
vulkan: optimize and reenable split_k (llama/10637)
...
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
2024-12-08 20:14:35 +02:00
PAB
b7c64a4352
ggml: add GGML_SET
Metal kernel + i32 CPU kernel (ggml/1037)
...
* implemented cpu kernel
* add i32 test cases in test-backend-ops
* typedef `ggml_metal_kargs_set`
* implemented `kernel_set`
* memcpy
2024-12-08 20:14:35 +02:00
PAB
7895d39508
ggml : add GGML_PAD_REFLECT_1D
operation (ggml/1034)
...
* ggml_pad_reflect_1d defined in header
* implemented on CPU
* called the forward pass
* impl Metal kernel
* added Metal kernel
* added OP_PAD_REFLECT_1D in test-backend-ops.cpp
* add test-pad-reflect-1d test case
* test case support multiple backend
2024-12-08 20:14:35 +02:00
Georgi Gerganov
22616f00f9
files : remove make artifacts
2024-12-08 20:14:35 +02:00
Georgi Gerganov
02c6fcbc2c
common : fix compile warning
...
ggml-ci
2024-12-08 20:14:35 +02:00
Diego Devesa
3daeacad24
ggml : move AMX to the CPU backend (llama/10570)
...
ggml : automatic selection of best CPU backend (llama/10606)
2024-12-08 20:14:35 +02:00
Georgi Gerganov
4d73962da4
metal : small-batch mat-mul kernels (llama/10581)
...
* metal : small-batch mat-mul kernels
ggml-ci
* metal : add rest of types
ggml-ci
* metal : final adjustments
ggml-ci
* metal : add comments
ggml-ci
2024-12-08 20:14:35 +02:00
Akarshan Biswas
068812650e
SYCL: Fix and switch to GGML_LOG system instead of fprintf (llama/10579)
...
* Switched to GGML_LOG
* Fix missing semicolon
2024-12-08 20:14:35 +02:00
Adrien Gallouët
4b7e059e15
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() (llama/10567)
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-12-08 20:14:35 +02:00
Eve
30e35d7271
vulkan: Dynamic subgroup size support for Q6_K mat_vec (llama/10536)
...
* subgroup 64 version with subgroup add. 15% faster
scalable version
tested for subgroup sizes 16-128
* check for subgroup multiple of 16 and greater than 16
* subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45 )
* force 16 sequential threads per block
* make 16 subgroup size a constant
2024-12-08 20:14:35 +02:00
Georgi Gerganov
3623bd58f2
ggml : fix I8MM Q4_1 scaling factor conversion (llama/10562)
...
ggml-ci
2024-12-08 20:14:35 +02:00
Shupei Fan
cb847c20a7
ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (llama/10580)
2024-12-08 20:14:35 +02:00
Alberto Cabrera Pérez
964b154a2a
sycl : offload of get_rows set to 0 (llama/10432)
2024-12-08 20:14:35 +02:00
Alberto Cabrera Pérez
d7c2a04bce
sycl : Reroute permuted mul_mats through oneMKL (llama/10408)
...
This PR fixes the failing MUL_MAT tests for the sycl backend.
2024-12-08 20:14:35 +02:00
Chenguang Li
2bb4ca9cba
CANN: RoPE operator optimization (llama/10563)
...
* [cann] RoPE operator optimization
* [CANN]Code Formatting
---------
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-12-08 20:14:35 +02:00
Jeff Bolz
a753a82462
vulkan: get the first command buffer submitted sooner (llama/10499)
...
This is an incremental improvement over #9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.
With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
2024-12-08 20:14:35 +02:00
Georgi Gerganov
276b08d8f0
ggml : remove redundant copyright notice + update authors
2024-12-08 20:14:35 +02:00
Georgi Gerganov
4ca1e72fe0
ggml : fix row condition for i8mm kernels (llama/10561)
...
ggml-ci
2024-12-08 20:14:35 +02:00
Georgi Gerganov
16a66f103f
cmake : fix ARM feature detection (llama/10543)
...
ggml-ci
2024-12-08 20:14:35 +02:00
Shupei Fan
330273901f
ggml-cpu: support IQ4_NL_4_4 by runtime repack (llama/10541)
...
* ggml-cpu: support IQ4_NL_4_4 by runtime repack
* ggml-cpu: add __ARM_FEATURE_DOTPROD guard
2024-12-08 20:14:35 +02:00
Sergio López
42099a9342
kompute : improve backend to pass test_backend_ops (llama/10542)
...
* kompute: op_unary: reject unsupported parameters
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: softmax: implement ALiBi support
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: rope: implement neox and phi3 support
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: op_mul_mat_q4_k permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: op_mul_mat_[q4_0|q4_1|q8_0] permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: op_mul_mat_f16 permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com>
* kompute: op_mul_mat_q6_k permutted support
Signed-off-by: Sergio Lopez <slp@redhat.com>
---------
Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-12-08 20:14:35 +02:00
leo-pony
90dd5fca9c
CANN: Fix SOC_TYPE compile bug (llama/10519)
...
* CANN: Fix the bug build fail on Ascend310P under two cases:
1) Manual specify SOC_TYPE
2) Under some unusual compile environment
* Update the cann backend News content: Support F16 and F32 data type model for Ascend 310P NPU.
* fix CANN compile fail bug: the assert in ascend kernel function doesn't supportted on some CANN version
2024-12-08 20:14:35 +02:00
Chenguang Li
2490f2a7f8
CANN: ROPE operator optimization (llama/10540)
...
* [cann] ROPE operator optimization
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-12-08 20:14:35 +02:00
uvos
230e985633
Add some minimal optimizations for CDNA (llama/10498)
...
* Add some minimal optimizations for CDNA
* ggml_cuda: set launch bounds also for GCN as it helps there too
2024-12-08 20:14:35 +02:00
Georgi Gerganov
ae24083f23
metal : fix group_norm support condition (llama/0)
2024-12-08 20:14:35 +02:00
Jeff Bolz
6463e36369
vulkan: define all quant data structures in types.comp (llama/10440)
2024-12-08 20:14:35 +02:00
Jeff Bolz
b3301f7d82
vulkan: Handle GPUs with less shared memory (llama/10468)
...
There have been reports of failure to compile on systems with <= 32KB
of shared memory (e.g. #10037 ). This change makes the large tile size
fall back to a smaller size if necessary, and makes mul_mat_id fall
back to CPU if there's only 16KB of shared memory.
2024-12-08 20:14:35 +02:00
Jeff Bolz
ab5d4d93ec
vulkan: further optimize q5_k mul_mat_vec (llama/10479)
2024-12-08 20:14:35 +02:00
Jeff Bolz
2d6e9dd723
vulkan: skip integer div/mod in get_offsets for batch_idx==0 (llama/10506)
2024-12-08 20:14:35 +02:00
Jeff Bolz
2f16e51553
vulkan: optimize Q2_K and Q3_K mul_mat_vec (llama/10459)
2024-12-08 20:14:35 +02:00
R0CKSTAR
0f0994902f
mtgpu: Add MUSA_DOCKER_ARCH in Dockerfiles && update cmake and make (llama/10516)
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-12-08 20:14:35 +02:00
Jeff Bolz
5e1fcc1780
vulkan: fix group_norm (llama/10496)
...
Fix bad calculation of the end of the range. Add a backend test that
covers the bad case (taken from stable diffusion).
Fixes https://github.com/leejet/stable-diffusion.cpp/issues/439 .
2024-12-08 20:14:35 +02:00