Georgi Gerganov
64a56ebf13
ci : disable java build
2024-07-08 14:26:59 +03:00
Emmanuel Schmidbauer
bec9836849
server : add inference path to make OAI API compatible ( #2270 )
2024-07-08 14:24:58 +03:00
Georgi Gerganov
c118733a29
sync : ggml + fix sync script
2024-06-26 23:20:19 +03:00
Georgi Gerganov
bb3dd45524
make : disable CUDA graphs
2024-06-26 23:20:13 +03:00
slaren
04e7fa6f4f
ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (llama/8140)
2024-06-26 23:18:11 +03:00
Georgi Gerganov
9f7f36d4c9
make : disable CUDA mel build
2024-06-26 22:25:25 +03:00
Georgi Gerganov
4a62efbb95
cmake : minor fixes
2024-06-26 21:42:39 +03:00
Georgi Gerganov
0a55a70b9b
make : fix missing -O3
...
same as https://github.com/ggerganov/llama.cpp/pull/8143
2024-06-26 21:21:12 +03:00
Georgi Gerganov
dc8cc2dd6f
whisper : disable CUDA mel + fix FFMPEG
2024-06-26 20:11:38 +03:00
Georgi Gerganov
3efedb9511
sync : ggml
2024-06-26 19:40:23 +03:00
Georgi Gerganov
e30c679928
whisper : reorganize source code + improve CMake ( #2256 )
...
* scripts : update sync [no ci]
* files : reorganize [no ci]
* sync : llama.cpp
* cmake : link math library
* cmake : build normal ggml library
* files : move headers to include
* objc : fix path to ggml-metal.h
* ci : fix WHISPER_CUDA -> GGML_CUDA
* scripts : sync LICENSE [no ci]
2024-06-26 19:34:09 +03:00
mky_coder
bf4cb4abad
whisper : optimize fft() function ( #2242 )
...
Co-authored-by: Mike Fan <60965742+mike-fzy@users.noreply.github.com>
2024-06-18 18:10:33 +03:00
Georgi Gerganov
e293f17d34
talk-llama : sync llama.cpp
2024-06-18 09:45:37 +03:00
Georgi Gerganov
5d950c4b8d
whisper : use ggml_backend_sched ( #2239 )
...
* whisper : use ggml_backend_sched (wip)
* use sched in whisper_allocr
* whisper : single backend in whisper_context
* whisper : remove whisper_state->backends_used
* whisper : remove whisper_context->backend
* whisper : reset scheduler after init
* whisper : fix external encoder (e.g. CoreML)
* whisper : cleanup
* whisper : handle null GPU buffer types + fix sycl
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-06-18 09:39:40 +03:00
Georgi Gerganov
820446e230
fix : remove extra files
2024-06-18 09:39:40 +03:00
Georgi Gerganov
54d5823ebe
scripts : sync ggml-blas
2024-06-18 09:39:40 +03:00
Georgi Gerganov
5181494e9f
build : update make / cmake
2024-06-18 09:39:40 +03:00
Georgi Gerganov
4a6e6e8b30
sync : ggml
2024-06-18 09:39:40 +03:00
slaren
de29b193f6
move BLAS to a separate backend (cont) (llama/6210)
...
ggml-ci
2024-06-18 09:39:40 +03:00
0cc4m
922971041b
Vulkan Shader Refactor, Memory Debugging Option (llama/7947)
...
* Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory
* Improve debug log code
* Add memory debug output option
* Fix flake8
* Fix unnecessary high llama-3 VRAM use
2024-06-18 09:39:40 +03:00
Georgi Gerganov
63a767a134
scripts : stop sync whisper example from ggml
2024-06-18 09:39:40 +03:00
Georgi Gerganov
30841fa786
cmake : fix sycl build ( #0 )
2024-06-16 18:19:48 +03:00
Georgi Gerganov
3b1ac03828
ggml : remove OpenCL ( #0 )
2024-06-16 18:19:48 +03:00
Georgi Gerganov
990de617b5
sycl : sync ( #0 )
2024-06-16 18:19:48 +03:00
Georgi Gerganov
6975600b4b
cuda : enable CUDA graphs ( #0 )
2024-06-16 18:19:48 +03:00
Georgi Gerganov
061eeb9f61
talk-llama : sync llama.cpp
2024-06-16 18:19:48 +03:00
Georgi Gerganov
4942b1b428
cmake : fix CUDA build ( #0 )
2024-06-16 18:19:48 +03:00
Georgi Gerganov
3c7cc5c437
sync : ggml
...
ggml-ci
2024-06-16 18:19:48 +03:00
Hong Bo PENG
5cd42ee2cc
ggml : fix and optimize ppc64le (ggml/849)
...
* fix compile issues introduced by loongarch_asx
* restore quant changes to merge
* fix compile issues introduced by loongarch_asx
* further optimize by using vec_msum & vec_sum4s on ppc64le
2024-06-16 18:19:48 +03:00
Daniel Bevenius
ee718f3da6
ggml : remove duplicate include of ggml-common.h (ggml/853)
...
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-06-16 18:19:48 +03:00
Meng, Hengyu
63eac1f608
remove global variables (llama/7710)
...
* separate DPCT helpers outside
* replace global variables with context
* remove useless extra
* update mul_mat condition
* remove duplicate buft initialization
* remove duplicate extra and global work group size
* remove useless backend check
* remove duplicated extras
* use macro for group_size and remove cuda-related
2024-06-16 18:19:48 +03:00
Johannes Gäßler
b17ba2815b
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)
...
* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores
* try CI fix
* try CI fix
* try CI fix
* fix data race
* rever q2_K precision related changes
2024-06-16 18:19:48 +03:00
Georgi Gerganov
7a489af2f3
metal : utilize max shared memory for mul_mat_id (llama/7935)
2024-06-16 18:19:48 +03:00
Radoslav Gerganov
4a4ea13d6d
rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)
2024-06-16 18:19:48 +03:00
slaren
174a461fc6
move BLAS to a separate backend (llama/6210)
...
* move BLAS to a separate backend
* rename GGML_USE_OPENBLAS to GGML_USE_BLAS
* alloc : reuse same buffer when the same buffer type if used multiple times
* set number of threads automatically for openblas and blis
* sched : print assignments when GGML_SCHED_DEBUG env variable is set
* sched : allow ops with weights on an incompatible buffer type
This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
Johannes Gäßler
d8b7a24bc9
CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)
2024-06-16 18:19:48 +03:00
Georgi Gerganov
acf3832c9c
tests : add non-cont unary tests (llama/7857)
...
* tests : add non-cont unary tests
* ggml : update unary asserts and "supports_op"
ggml-ci
2024-06-16 18:19:48 +03:00
Georgi Gerganov
d29ac44303
ggml : improve ggml_is_contiguous logic (llama/7856)
...
* ggml : improve ggml_is_contiguous logic
ggml-ci
* ggml : support more contiguous cases
ggml-ci
2024-06-16 18:19:48 +03:00
k.h.lai
12638dfef0
vulkan: select only one device for single gpu with multiple drivers (llama/7582)
2024-06-16 18:19:48 +03:00
0cc4m
f100b3b523
Update Vulkan RoPE implementation (llama/7818)
...
* Update Vulkan RoPE implementation
* Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception
Minor fixes
* Fix segfault when running out of VRAM
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-06-16 18:19:48 +03:00
Johannes Gäßler
a99e213a82
CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)
2024-06-16 18:19:48 +03:00
Johannes Gäßler
7483d2b61c
CUDA: use tensor cores for MMQ (llama/7676)
...
* CUDA: int8 tensor cores for MMQ (legacy quants)
* fix out-of-bounds writes
* __builtin_assume -> GGML_CUDA_ASSUME
* fix writeback returning too early
2024-06-16 18:19:48 +03:00
Ben Ashbaugh
1fe5948227
use the correct SYCL context for host USM allocations (llama/7777)
...
Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>
2024-06-16 18:19:48 +03:00
Johannes Gäßler
760497e1ab
CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)
2024-06-16 18:19:48 +03:00
slaren
b172e7714c
vulkan : reuse parent extra for views (llama/7806)
...
* vulkan : reuse parent extra for views
* Fix validation error when multiple compute contexts are used in a graph
---------
Co-authored-by: 0cc4m <picard12@live.de>
2024-06-16 18:19:48 +03:00
pengxin99
dc01aadb18
fix softmax r2r result wrong issue (llama/7811)
2024-06-16 18:19:48 +03:00
Johannes Gäßler
e08c62149b
CUDA: refactor mmq, dmmv, mmvq (llama/7716)
...
* CUDA: refactor mmq, dmmv, mmvq
* fix out-of-bounds write
* struct for qk, qr, qi
* fix cmake build
* mmq_type_traits
2024-06-16 18:19:48 +03:00
Georgi Gerganov
abab4500fa
ggml : refactor rope norm/neox (llama/7634)
...
* ggml : unify rope norm/neox (CPU)
* ggml : fix compile warning
* ggml : remove GLM rope mode
ggml-ci
* metal : better rope implementation
ggml-ci
* cuda : better rope implementation
ggml-ci
* naming : n_orig_ctx -> n_ctx_orig
ggml-ci
* dev : add reminders to update backends
ggml-ci
* vulkan : fix ggml_rope_ext() usage
* cuda : fix array size + indents
ggml-ci
2024-06-16 18:19:48 +03:00
agray3
e666315fa8
Allow number of nodes in CUDA graph to change (llama/7738)
...
Previously the code would have failed to cope in the case that the
number of nodes changes in an existing CUDA graph. This fixes the
issue by removing an unnecessary conditional.
2024-06-16 18:19:48 +03:00
Georgi Gerganov
3f869af14c
ggml : remove OpenCL (llama/7735)
...
ggml-ci
2024-06-16 18:19:48 +03:00