Commit Graph

985 Commits

Author SHA1 Message Date
9e0cc28792 metal : fix deprecation warning (ggml/690) 2024-01-11 21:50:00 +02:00
73072a7c73 ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693) 2024-01-11 21:50:00 +02:00
a8ba1262ff metal : wrap each operation in debug group (ggml/690) 2024-01-11 21:50:00 +02:00
e66a9a7806 ggml : change GGML_MAX_NAME at compile time (ggml/682)
* change GGML_MAX_NAME to 128

* allow controlling the value of GGML_MAX_NAME through external macro definitions
2024-01-11 21:50:00 +02:00
338442d773 Fix execlp call (ggml/689)
NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.
2024-01-11 21:50:00 +02:00
10651bddf6 SOTA 2-bit quants (llama/4773)
* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:50:00 +02:00
53d4d0b30d CUDA: fixed redundant value dequantization (llama/4809) 2024-01-11 21:50:00 +02:00
2865e4710b ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (llama/4787) 2024-01-11 21:50:00 +02:00
c46a74a19d ggml : do not sched_yield when calling BLAS (llama/4761)
* ggml : do not sched_yield when calling BLAS

ggml-ci

* ggml : fix do_yield logic

ggml-ci

* ggml : simplify do_yield logic

ggml-ci
2024-01-11 21:50:00 +02:00
46dc49a6a1 ggml : include stdlib.h before intrin.h (llama/4736) 2024-01-11 21:49:59 +02:00
cc7f872131 swift : checkout ggml commit instead of branch (#1750) 2024-01-10 18:12:06 +02:00
bcc1658cd0 talk-llama : add optional Piper TTS support (#1749)
Add commented-out command to optionally use Piper (https://github.com/rhasspy/piper) as text-to-speech solution for the talk-llama example. Piper voices sound almost like real people which is a big improvement (e.g.) from something like espeak.
2024-01-10 16:15:28 +02:00
c46886f599 server : add request path option(#1741) 2024-01-08 22:39:51 +00:00
29f78392c1 main : add cli option to disable system prints (#1740) 2024-01-08 16:41:28 +02:00
022756a872 server : fix server temperature + add temperature_inc (#1729)
* server : fix server temperature + add temperature_inc

* server : change dashes to underscores in parameter names
2024-01-07 13:35:14 +02:00
3b8c2dff57 talk-llama : sync latest llama.cpp 2024-01-06 17:22:57 +02:00
0b9af32a8b release : v1.5.4 v1.5.4 2024-01-05 17:11:27 +02:00
11b1b63b14 fix : cuda order of synchronization when setting a buffer (ggml/679)
* fix : cuda order of synchronization when setting a buffer

* also sync before memcpy

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-05 17:01:59 +02:00
0e26a6c92e metal : switch back to default.metallib (ggml/681)
ggml-ci
2024-01-05 16:31:30 +02:00
66d8f0b7f1 ggml : fix q2_k bpw in comments (ggml/680) 2024-01-05 16:31:20 +02:00
ba5bcde874 coreml : fix ANE optimized encoder (#1716) 2024-01-04 16:28:30 +02:00
ab0a8593c5 whisper.swiftui : add .gitignore 2024-01-04 15:00:27 +02:00
668ffc9b23 whispser : reset the "batched" timings (#1721) 2024-01-04 13:38:39 +02:00
9962371f71 release : v1.5.3 v1.5.3 2024-01-03 19:36:33 +02:00
993acb5d41 swift : update Package.swift to use ggml as package dependency (#1701)
* updates Package.swift to use ggml as dependency

* cleans up the Package.swift file by removing redundant source files

* updates ggml url src to ggerganov
2024-01-03 19:30:26 +02:00
a3d0aa73d1 ggml : add error handling to graph_compute (#1714) 2024-01-03 15:39:43 +02:00
14c57952f7 cuda : simplify expression
Co-authored-by: slaren <slarengh@gmail.com>
2024-01-03 14:43:51 +02:00
6c369d6788 cuda : mark I16 and I32 ops as unsupported
ggml-ci
2024-01-03 14:43:51 +02:00
4cdd9aad9b metal : add kernel_get_rows_i32
ggml-ci
2024-01-03 14:43:51 +02:00
f38c057503 metal : optimize ggml_mul_mat_id (faster Mixtral PP) (llama/4725)
* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (llama/4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

* metal : optimizing ggml_mul_mat_id (wip)

* metal : minor fix

* metal : opt mul_mm_id
2024-01-03 14:43:51 +02:00
1e5544b39b metal : enable shader debugging (cmake option) (llama/4705)
* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (llama/4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

ggml-ci
2024-01-03 14:43:51 +02:00
d5673af79f ggml : add ggml_vdotq_s32 alias (llama/4715)
ggml-ci
2024-01-03 14:43:51 +02:00
a28dacec65 CUDA: fixed tensor cores not being used on RDNA3 (llama/4697) 2024-01-03 14:43:51 +02:00
dbe29d4e33 ggml : add ggml_cpu_has_avx_vnni() (llama/4589)
* feat: add avx_vnni based on intel documents

* ggml: add avx vnni based on intel document

* llama: add avx vnni information display

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* Update ggml.c

Fix indentation upgate

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03 14:43:51 +02:00
fe3a67c546 CUDA: fix tensor core logic for Pascal and HIP (llama/4682) 2024-01-03 14:43:51 +02:00
b138ff2be3 cuda: fix vmm oom issue on NVIDIA AGX Orin (llama/4687)
Signed-off-by: hydai <hydai@secondstate.io>
2024-01-03 14:43:51 +02:00
cf6f1e4181 ggml : extend ggml_get_rows, ggml_repeat, ggml_concat (ggml/639)
* add more int ops

* ggml_compute_forward_dup_bytes

* add tests

* PR comments

* tests : minor indentations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03 14:43:51 +02:00
620a223814 scripts : fix sync order + metal sed 2024-01-03 14:43:51 +02:00
f39f9690ec examples : fix WASM Stack Overflow (#1713)
Fix for problem:

"""
RuntimeError: Aborted(Stack overflow! Stack cookie has been overwritten at 0x12be2b10, expected hex dwords 0x89BACDFE and 0x2135467, but received 0x00000000 0x00000000)
"""

That appears when executing the WASM example with the newer versions.
2024-01-02 16:50:04 +00:00
f9ca90256b docker : fix the publishing of the CUDA Docker image (#1704) 2023-12-30 23:12:31 +02:00
2623640cd6 scripts : do not sync commits from this repo 2023-12-29 15:03:08 +02:00
d87de61ae6 ci : build with CLBlast + ggml-opencl use GGML_API (#1576)
* Build with CLBlast

* Declare GGML_API

After rebasing, examples/talk-llama failed:

"D:\a\whisper.cpp\whisper.cpp\build\ALL_BUILD.vcxproj" (build target) (1) ->
"D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj" (default target) (14) ->
(Link target) ->
  llama.obj : error LNK2019: unresolved external symbol ggml_cl_free_data referenced in function "public: __cdecl llama_model::~llama_model(void)" (??1llama_model@@QEAA@XZ) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
  llama.obj : error LNK2019: unresolved external symbol ggml_cl_transform_tensor referenced in function "public: void __cdecl llama_model_loader::load_all_data(struct ggml_context *,void (__cdecl*)(float,void *),void *,struct llama_mlock *)" (?load_all_data@llama_model_loader@@QEAAXPEAUggml_context@@P6AXMPEAX@Z1PEAUllama_mlock@@@Z) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
  D:\a\whisper.cpp\whisper.cpp\build\bin\Release\talk-llama.exe : fatal error LNK1120: 2 unresolved externals [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
2023-12-29 12:23:27 +02:00
f5f485f899 whisper : replace tensor->n_dims with ggml_n_dims(tensor) (#1694) 2023-12-29 11:38:35 +02:00
e77b27c331 sync : ggml (VMM, sync-ggml-am, dotprod ARM fixes, CUDA fixes) (#1691)
* scripts : add sync-ggml-am.sh

* sync : ggml (VMM, ARM dot prod fix, etc.)

* build : fix CUDA build

* ggml : fix some mul mat cases + add tests for src1 F16

dbd02958fa
2023-12-29 11:30:47 +02:00
a5cc3dc8a2 download : fix large q5 model name (#1695)
fixed typo in large-v3-q5-0 model name to match HF link
2023-12-29 11:14:32 +02:00
37a709f655 whisper : Replace WHISPER_PRINT_DEBUG with WHISPER_LOG_DEBUG (#1681) 2023-12-23 12:02:58 +00:00
3a5302108d sync : ggml (ggml_scale, ggml_row_size, etc.) (#1677)
* sync : ggml

* sync : llama.cpp

* talk-llama : fix obsolete param

* ggml-alloc : fix ggml_tallocr_is_own

* talk.wasm : update to new ggml

* ggml : fix type punning in ggml_scale

* ggml : cuda jetson + arm quants warnings
2023-12-22 17:53:39 +02:00
d2ee117a0a docker : Dockerize whisper.cpp (#1674)
* build: add dockerfile for ci

* ci: add action to build/push docker image

* fix: lowercase repository to fix ci

* ci: update cuBLAS flag

* build: install curl and ffmped in image

* docs: add docker section

* fix: improve args check when download model
2023-12-22 11:16:02 +00:00
db8ccdb850 CI : Add coverage for talk-llama when WHISPER_CUBLAS=1 (#1672) 2023-12-21 22:39:46 +00:00
d2419030b0 examples : Revert CMakeLists.txt for talk-llama (#1669) 2023-12-21 22:48:52 +02:00