Commit Graph

935 Commits

Author SHA1 Message Date
Georgi Gerganov
8fb5c6a409
backend : add eval callback (llama/4935)
* backend : add eval callback

ggml-ci

* backend : group nodes in a single compute when user don't need them

* backend : clean-up the implementation

ggml-ci

* simple : do not perform tensor data copy if not needed

* simple : fix

* simple : no need for ggml_is_contiguous + fix bool parse

* llama : fix callback placement in llama_context_params

* backend : avoid double-ask callback calls

* simple : restore examples, imatrix will serve as a demo
2024-01-17 21:21:10 +02:00
Georgi Gerganov
2fe5fbfcc2
metal : create autorelease pool during library build (llama/4970)
* metal : create autorelease pool during library build

ggml-ci

* test : simplify

ggml-ci
2024-01-17 21:21:10 +02:00
Kawrakow
01637e1a4c
ggml : importance matrix support for legacy quants (llama/4969)
* imatrix: adding support for legacy quants

* imatrix: guard Q4_0/Q5_0 against ffn_down craziness

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-17 21:21:10 +02:00
Alex Azarov
1b349eb1f9
metal : log recommendedMaxWorkingSetSize on iOS 16+ (llama/4936)
* metal: Log `recommendedMaxWorkingSetSize` on iOS 16+

* Only log on iOS and macOS, ignoring tvOS and other platforms

* Check for Xcode version before using recommendedMaxWorkingSetSize

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-17 21:21:10 +02:00
Justine Tunney
138eaebead
ggml : introduce GGML_CALL function annotation (llama/4850)
This change makes it possible to build ggml-cuda.cu and ggml-metal.m as
independent dynamic shared objects, that may be conditionally linked at
runtime in a multiplatform binary. It introduces a GGML_CALL annotation
that documents which functions have a cyclic call relationship, between
the application code and GPU modules.

This change does nothing, unless the build defines -DGGML_MULTIPLATFORM
which causes back-references and function pointers to conform to MS ABI
which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms
2024-01-17 21:21:09 +02:00
Georgi Gerganov
61b9192f27
cuda : fix dequantize kernel names (llama/4938) 2024-01-17 21:21:09 +02:00
Kawrakow
161b51d91a
CUDA: faster dequantize kernels for Q4_0 and Q4_1 (llama/4938)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-17 21:21:09 +02:00
Kawrakow
f904b31a7d
Add ability to use importance matrix for all k-quants (llama/4930)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-17 21:21:09 +02:00
Benjamin Heiniger
f6614155e4
talk-llama : optional wake-up command and audio confirmation (#1765)
* talk-llama: add optional wake-word detection from command

* talk-llama: add optional audio confirmation before generating answer

* talk-llama: fix small formatting issue in output

* talk-llama.cpp: fix Windows build
2024-01-16 15:52:01 +02:00
Przemysław Pawełczyk
f5f159c320
server : fix building and simplify lib deps on Windows (#1772)
* make : fix server example building on MSYS2 environments (Windows)

It was not working since commit eff3570f78
when server was introduced.

* cmake : simplify server example lib deps on Windows

server uses httplib::Server, not httplib::SSLServer, so there is no need
to mention cryptographic libraries in target_link_libraries.
Winsock (ws2_32) suffices here.

Also use plain library names like we use in other places.
2024-01-15 15:48:13 +02:00
Georgi Gerganov
6ebba525f1
talk-llama : sync llama.cpp 2024-01-14 18:08:20 +02:00
Georgi Gerganov
2a5874441d
talk-llama : llama.cpp 2024-01-14 11:06:28 +02:00
Georgi Gerganov
d08445c9ad
sync : ggml 2024-01-14 10:55:18 +02:00
Alex Azarov
4a945696cb
metal : correctly set SIMD support flags on iOS (llama/4923)
* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad

* log a little bit more info on iOS
2024-01-14 10:54:09 +02:00
Kawrakow
dabc964d83
2-bit quantizations (llama/4897)
* imatrix: load

* imatrix: WIP

* imatrix: Add Q2_K quantization

* imatrix: also guard against Q2_K_S quantization without importance matrix

* imatrix: guard even more against low-bit quantization misuse

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 10:54:09 +02:00
Georgi Gerganov
654baf693d
scripts : sync-ggml-am.sh add option to skip commits 2024-01-14 10:53:19 +02:00
Georgi Gerganov
f001a3b7b6
talk-llama : sync llama.cpp 2024-01-14 00:13:17 +02:00
Georgi Gerganov
c615f2c335
sync : ggml 2024-01-14 00:12:17 +02:00
Georgi Gerganov
d839dd0242
examples : adapt to metal API 2024-01-14 00:11:45 +02:00
Johannes Gäßler
435847891c
ggml: cache sin/cos for RoPE (llama/4908) 2024-01-14 00:11:45 +02:00
Georgi Gerganov
182f290808
metal : remove old API (llama/4919)
ggml-ci
2024-01-14 00:11:45 +02:00
Georgi Gerganov
447dfc11fc
metal : disable log for loaded kernels (llama/4794) 2024-01-14 00:11:45 +02:00
texmex76
9aa9f3b84e
gguf : fix potential infinite for-loop (llama/4600)
Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>
2024-01-14 00:11:44 +02:00
Georgi Gerganov
396ebd1e80
metal : refactor kernel loading code (llama/4794)
* metal : detect more GPU families

* metal : refactor kernel loading

* metal : set kernel family requirements

* metal : fix kernel init + fix compile options

* metal : take into account simdgroup reduction support

* metal : print only skipped kernels

* metal : fix check for simdgroup reduction support

* metal : check for Metal 3

* metal : free allocations

* metal : normalize encoder:setComputePipelineStatus calls

ggml-ci

* metal : fix Metal3 family check

ggml-ci

* metal : check for simdgroup matrix mul. feature

ggml-ci
2024-01-14 00:11:44 +02:00
Johannes Gäßler
12490f4398
CUDA: faster q8_0 -> f16 dequantization (llama/4895) 2024-01-14 00:11:44 +02:00
RhinoDevel
db078a9ba8
talk-llama : add optional CLI arg to set the bot name (#1764) 2024-01-13 20:51:35 +02:00
james wolf
a13a7da5ad
examples : add python example for transcription (#1744)
* rebase and add simple python interface

* moved python files to examples/python
2024-01-13 19:37:18 +02:00
Georgi Gerganov
519f8e8684
whisper : load the model into multiple buffers of max size 1GB (#1763) 2024-01-13 17:47:40 +02:00
Georgi Gerganov
40ae0962f4
talk-llama : sync llama.cpp 2024-01-12 22:04:51 +02:00
Georgi Gerganov
1560288048
sync : ggml 2024-01-12 21:56:50 +02:00
slaren
1ad6fafd91
backend_sched : fix assignments
ggml-ci
2024-01-12 21:55:42 +02:00
slaren
70840aed5f
llama : ggml-backend integration (llama/4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (llama/4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (llama/4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12 21:55:42 +02:00
Johannes Gäßler
b24d18feb9
CUDA: fix softmax compile for old CUDA versions (llama/4862) 2024-01-12 21:55:41 +02:00
Kawrakow
3fa98f4395
Importance Matrix calculation (llama/4861)
* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 21:55:41 +02:00
Sơn Phan Trung
d05b7ee90e
models : make all scripts to be POSIX Compliant (#1725)
* download-coreml-model: make it POSIX-compliant

* download-ggml-model: posix compliant (2nd)

* minor edit

* forgot to add newline

* generate-coreml-interface: far more straightforward

* generate-coreml-model: done with the posix thingy

* typo

* Update download-ggml-model.sh

* fix

* fix typo

* another fix

* Update download-coreml-model.sh

* Update download-ggml-model.sh

* Update download-coreml-model.sh
2024-01-12 14:11:04 +02:00
Georgi Gerganov
6dcee35129
ggml : fix 32-bit ARM compat for IQ2_XS (#1758)
* ggml : fix 32-bit ARM compat

* ggml : fix fix

* ggml : fix fix fix
2024-01-12 14:02:30 +02:00
Boris Bliznioukov
5cb345f5e9
go : add SetInitialPrompt method to bindings (#1753) 2024-01-12 13:44:50 +02:00
George Hindle
fbcb52d3cd
server : add more parameters to server api (#1754)
* feat(server): add more parameters to server api

* fix(server): reset params to original parsed values for each request
2024-01-12 13:42:52 +02:00
Georgi Gerganov
6b01e3fedd
whisper : fix segment length with params.no_timestamps == true 2024-01-12 13:37:38 +02:00
George Hindle
f7908f9bb8
params : don't compute timestamps when not printing them (#1755) 2024-01-12 13:24:38 +02:00
Georgi Gerganov
00b7a4be02
talk-llama : sync llama.cpp 2024-01-11 22:10:10 +02:00
Georgi Gerganov
04b0a768b8
swift : remove local ggml.h reference 2024-01-11 22:00:12 +02:00
Georgi Gerganov
87670425f2
swift : track ggml release branch 2024-01-11 21:57:40 +02:00
Georgi Gerganov
32e71a1861
sync : ggml 2024-01-11 21:54:17 +02:00
Georgi Gerganov
9c857cf280
sync : llama.cpp 2024-01-11 21:50:01 +02:00
Kawrakow
97b12212dd
ggml : SOTA 2-bit quants (add IQ2_XS) (llama/4856)
* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:50:01 +02:00
Paul Tsochantaris
9fa34d79ec
metal : put encoder debug group behind a define (llama/4873) 2024-01-11 21:50:01 +02:00
Georgi Gerganov
a0a64a19dd
metal : improve dequantize precision to match CPU (llama/4836)
ggml-ci
2024-01-11 21:50:01 +02:00
Georgi Gerganov
bbc23611fa
ggml : fix vld1q_s8_x4 32-bit compat (llama/4828)
* ggml : fix vld1q_s8_x4 32-bit compat

ggml-ci

* ggml : fix 32-bit ARM compat (cont)

ggml-ci
2024-01-11 21:50:01 +02:00
Johannes Gäßler
e9783a1fb4
CUDA: faster softmax via shared memory + fp16 math (llama/4742) 2024-01-11 21:50:01 +02:00