Johannes Gäßler
bf88c94da9
CUDA: fix MMQ stream-k for --split-mode row (llama/8167)
2024-07-08 14:53:55 +03:00
John Balis
3eea171cab
feat: cuda implementation for ggml_conv_transpose_1d
(ggml/854)
...
* conv transpose 1d passing test for 1d input and kernel
* working for different input and output channel counts, added test for variable stride
* initial draft appears to work with stride other than 1
* working with all old and new conv1d tests
* added a test for large tensors
* removed use cuda hardcoding
* restored test-conv-transpose.c
* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail
* fixed accumulator bug
* added test to test-backend-ops
* fixed mistake
* addressed review
* fixed includes
* removed blank lines
* style and warning fixes
* return failure when test fails
* fix supports_op
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-07-08 14:53:55 +03:00
Georgi Gerganov
64a56ebf13
ci : disable java build
2024-07-08 14:26:59 +03:00
Emmanuel Schmidbauer
bec9836849
server : add inference path to make OAI API compatible ( #2270 )
2024-07-08 14:24:58 +03:00
Georgi Gerganov
c118733a29
sync : ggml + fix sync script
2024-06-26 23:20:19 +03:00
Georgi Gerganov
bb3dd45524
make : disable CUDA graphs
2024-06-26 23:20:13 +03:00
slaren
04e7fa6f4f
ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (llama/8140)
2024-06-26 23:18:11 +03:00
Georgi Gerganov
9f7f36d4c9
make : disable CUDA mel build
2024-06-26 22:25:25 +03:00
Georgi Gerganov
4a62efbb95
cmake : minor fixes
2024-06-26 21:42:39 +03:00
Georgi Gerganov
0a55a70b9b
make : fix missing -O3
...
same as https://github.com/ggerganov/llama.cpp/pull/8143
2024-06-26 21:21:12 +03:00
Georgi Gerganov
dc8cc2dd6f
whisper : disable CUDA mel + fix FFMPEG
2024-06-26 20:11:38 +03:00
Georgi Gerganov
3efedb9511
sync : ggml
2024-06-26 19:40:23 +03:00
Georgi Gerganov
e30c679928
whisper : reorganize source code + improve CMake ( #2256 )
...
* scripts : update sync [no ci]
* files : reorganize [no ci]
* sync : llama.cpp
* cmake : link math library
* cmake : build normal ggml library
* files : move headers to include
* objc : fix path to ggml-metal.h
* ci : fix WHISPER_CUDA -> GGML_CUDA
* scripts : sync LICENSE [no ci]
2024-06-26 19:34:09 +03:00
mky_coder
bf4cb4abad
whisper : optimize fft() function ( #2242 )
...
Co-authored-by: Mike Fan <60965742+mike-fzy@users.noreply.github.com>
2024-06-18 18:10:33 +03:00
Georgi Gerganov
e293f17d34
talk-llama : sync llama.cpp
2024-06-18 09:45:37 +03:00
Georgi Gerganov
5d950c4b8d
whisper : use ggml_backend_sched ( #2239 )
...
* whisper : use ggml_backend_sched (wip)
* use sched in whisper_allocr
* whisper : single backend in whisper_context
* whisper : remove whisper_state->backends_used
* whisper : remove whisper_context->backend
* whisper : reset scheduler after init
* whisper : fix external encoder (e.g. CoreML)
* whisper : cleanup
* whisper : handle null GPU buffer types + fix sycl
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-06-18 09:39:40 +03:00
Georgi Gerganov
820446e230
fix : remove extra files
2024-06-18 09:39:40 +03:00
Georgi Gerganov
54d5823ebe
scripts : sync ggml-blas
2024-06-18 09:39:40 +03:00
Georgi Gerganov
5181494e9f
build : update make / cmake
2024-06-18 09:39:40 +03:00
Georgi Gerganov
4a6e6e8b30
sync : ggml
2024-06-18 09:39:40 +03:00
slaren
de29b193f6
move BLAS to a separate backend (cont) (llama/6210)
...
ggml-ci
2024-06-18 09:39:40 +03:00
0cc4m
922971041b
Vulkan Shader Refactor, Memory Debugging Option (llama/7947)
...
* Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory
* Improve debug log code
* Add memory debug output option
* Fix flake8
* Fix unnecessary high llama-3 VRAM use
2024-06-18 09:39:40 +03:00
Georgi Gerganov
63a767a134
scripts : stop sync whisper example from ggml
2024-06-18 09:39:40 +03:00
Georgi Gerganov
30841fa786
cmake : fix sycl build ( #0 )
2024-06-16 18:19:48 +03:00
Georgi Gerganov
3b1ac03828
ggml : remove OpenCL ( #0 )
2024-06-16 18:19:48 +03:00
Georgi Gerganov
990de617b5
sycl : sync ( #0 )
2024-06-16 18:19:48 +03:00
Georgi Gerganov
6975600b4b
cuda : enable CUDA graphs ( #0 )
2024-06-16 18:19:48 +03:00
Georgi Gerganov
061eeb9f61
talk-llama : sync llama.cpp
2024-06-16 18:19:48 +03:00
Georgi Gerganov
4942b1b428
cmake : fix CUDA build ( #0 )
2024-06-16 18:19:48 +03:00
Georgi Gerganov
3c7cc5c437
sync : ggml
...
ggml-ci
2024-06-16 18:19:48 +03:00
Hong Bo PENG
5cd42ee2cc
ggml : fix and optimize ppc64le (ggml/849)
...
* fix compile issues introduced by loongarch_asx
* restore quant changes to merge
* fix compile issues introduced by loongarch_asx
* further optimize by using vec_msum & vec_sum4s on ppc64le
2024-06-16 18:19:48 +03:00
Daniel Bevenius
ee718f3da6
ggml : remove duplicate include of ggml-common.h (ggml/853)
...
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-06-16 18:19:48 +03:00
Meng, Hengyu
63eac1f608
remove global variables (llama/7710)
...
* separate DPCT helpers outside
* replace global variables with context
* remove useless extra
* update mul_mat condition
* remove duplicate buft initialization
* remove duplicate extra and global work group size
* remove useless backend check
* remove duplicated extras
* use macro for group_size and remove cuda-related
2024-06-16 18:19:48 +03:00
Johannes Gäßler
b17ba2815b
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)
...
* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores
* try CI fix
* try CI fix
* try CI fix
* fix data race
* rever q2_K precision related changes
2024-06-16 18:19:48 +03:00
Georgi Gerganov
7a489af2f3
metal : utilize max shared memory for mul_mat_id (llama/7935)
2024-06-16 18:19:48 +03:00
Radoslav Gerganov
4a4ea13d6d
rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)
2024-06-16 18:19:48 +03:00
slaren
174a461fc6
move BLAS to a separate backend (llama/6210)
...
* move BLAS to a separate backend
* rename GGML_USE_OPENBLAS to GGML_USE_BLAS
* alloc : reuse same buffer when the same buffer type if used multiple times
* set number of threads automatically for openblas and blis
* sched : print assignments when GGML_SCHED_DEBUG env variable is set
* sched : allow ops with weights on an incompatible buffer type
This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
Johannes Gäßler
d8b7a24bc9
CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)
2024-06-16 18:19:48 +03:00
Georgi Gerganov
acf3832c9c
tests : add non-cont unary tests (llama/7857)
...
* tests : add non-cont unary tests
* ggml : update unary asserts and "supports_op"
ggml-ci
2024-06-16 18:19:48 +03:00
Georgi Gerganov
d29ac44303
ggml : improve ggml_is_contiguous logic (llama/7856)
...
* ggml : improve ggml_is_contiguous logic
ggml-ci
* ggml : support more contiguous cases
ggml-ci
2024-06-16 18:19:48 +03:00
k.h.lai
12638dfef0
vulkan: select only one device for single gpu with multiple drivers (llama/7582)
2024-06-16 18:19:48 +03:00
0cc4m
f100b3b523
Update Vulkan RoPE implementation (llama/7818)
...
* Update Vulkan RoPE implementation
* Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception
Minor fixes
* Fix segfault when running out of VRAM
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-06-16 18:19:48 +03:00
Johannes Gäßler
a99e213a82
CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)
2024-06-16 18:19:48 +03:00
Johannes Gäßler
7483d2b61c
CUDA: use tensor cores for MMQ (llama/7676)
...
* CUDA: int8 tensor cores for MMQ (legacy quants)
* fix out-of-bounds writes
* __builtin_assume -> GGML_CUDA_ASSUME
* fix writeback returning too early
2024-06-16 18:19:48 +03:00
Ben Ashbaugh
1fe5948227
use the correct SYCL context for host USM allocations (llama/7777)
...
Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>
2024-06-16 18:19:48 +03:00
Johannes Gäßler
760497e1ab
CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)
2024-06-16 18:19:48 +03:00
slaren
b172e7714c
vulkan : reuse parent extra for views (llama/7806)
...
* vulkan : reuse parent extra for views
* Fix validation error when multiple compute contexts are used in a graph
---------
Co-authored-by: 0cc4m <picard12@live.de>
2024-06-16 18:19:48 +03:00
pengxin99
dc01aadb18
fix softmax r2r result wrong issue (llama/7811)
2024-06-16 18:19:48 +03:00
Johannes Gäßler
e08c62149b
CUDA: refactor mmq, dmmv, mmvq (llama/7716)
...
* CUDA: refactor mmq, dmmv, mmvq
* fix out-of-bounds write
* struct for qk, qr, qi
* fix cmake build
* mmq_type_traits
2024-06-16 18:19:48 +03:00
Georgi Gerganov
abab4500fa
ggml : refactor rope norm/neox (llama/7634)
...
* ggml : unify rope norm/neox (CPU)
* ggml : fix compile warning
* ggml : remove GLM rope mode
ggml-ci
* metal : better rope implementation
ggml-ci
* cuda : better rope implementation
ggml-ci
* naming : n_orig_ctx -> n_ctx_orig
ggml-ci
* dev : add reminders to update backends
ggml-ci
* vulkan : fix ggml_rope_ext() usage
* cuda : fix array size + indents
ggml-ci
2024-06-16 18:19:48 +03:00