Commit Graph

1290 Commits

Author SHA1 Message Date
c10db6ea28 release : v1.6.1 v1.6.1 2024-05-21 18:44:37 +03:00
1b51fdf170 examples : add support for decoding input with ffmpeg (Linux) (#2133)
- search for ffmpeg libs/headers at cmake time
- added ffmpeg-transcode.cpp into libcommon if ffmpeg on
- hooked ffmpeg trancoding in common read_wav(...)
- passed test:
./main -m ggml-base.en.bin -f samples/jfk.mp3
2024-05-21 18:31:41 +03:00
adee3f9c1f node : add flash_attn param (#2170) 2024-05-20 09:08:48 +03:00
4798be1f9a ci: Update build.yml to suppress warnings about node.js versions (#2166)
* Update actions to suppress warnings about old node.js

https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/

* Update actions/upload-artifact, specify android cmdline-tools-version

* Use java 20

gradle 8.1 complains against 21
https://docs.gradle.org/current/userguide/compatibility.html
2024-05-19 11:49:26 +03:00
08981d1bac release : v1.6.0 v1.6.0 2024-05-15 09:59:48 +03:00
7094ea5e75 whisper : use flash attention (#2152)
* whisper : use flash attention in the encoder

* whisper : add kv_pad

* whisper : remove extra backend instance (huh?)

* whisper : use FA for cross-attention

* whisper : use FA for self-attention

* whisper : simplify encoder FA

* whisper : add flash_attn runtime parameter

* scripts : add bench log

* scripts : add M1 Pro bench log
2024-05-15 09:38:19 +03:00
9d5771ae43 talk-llama : reject runs without required arguments (#2153)
* Extended talk-llama example to reject runs without required arguments.

Print warning and exit if models are not specified on the command line.

* Update examples/talk-llama/talk-llama.cpp

* Update examples/talk-llama/talk-llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-14 21:32:41 +03:00
f56b8305c4 sync : ggml 2024-05-14 19:16:32 +03:00
1056ad762c metal : support FA without mask + add asserts (llama/7278)
* ggml : fa without mask + add asserts

ggml-ci

* metal : support non-contiguous KV

ggml-ci
2024-05-14 19:16:29 +03:00
c451080c8b ggml : add RPC backend (llama/6829)
* ggml : add RPC backend

The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).

* set TCP_NODELAY

* add CI workflows

* Address review comments

* fix warning

* implement llama_max_devices() for RPC

* Address review comments

* Address review comments

* wrap sockfd into a struct

* implement get_alignment and get_max_size

* add get_device_memory

* fix warning

* win32 support

* add README

* readme : trim trailing whitespace

* Address review comments

* win32 fix

* Address review comments

* fix compile warnings on macos
2024-05-14 19:16:29 +03:00
8e7c22fbdb rm wait() (llama/7233) 2024-05-14 19:16:29 +03:00
e57e95eb0d CUDA: add FP32 FlashAttention vector kernel (llama/7188)
* CUDA: add FP32 FlashAttention vector kernel

* fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel
2024-05-14 19:16:29 +03:00
130f43e4b8 scripts : sync ggml-rpc 2024-05-14 19:15:35 +03:00
d8356a1cc2 whisper : fix model path encoding in windows (#2086)
* fix: model path encoding in windows

* fix: convert model path to wide string only for MSVC compiler
2024-05-14 09:43:41 +03:00
4ef8d9f44e server : return utf-8 (#2138) 2024-05-13 15:33:46 +03:00
3928dbd206 node : add audio_ctx and audio buffer params (#2123)
* node : add audio_ctx param

* node : support passing audio buffer directly

* node : parse audio_ctx in index.js

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-13 15:22:23 +03:00
2ced6f0742 cmake : fix HIP/ROCm build (#2102) 2024-05-13 15:18:43 +03:00
30f73109b8 node : add additional params (#2000)
* Add additional params to addon.node

* Add comma_in_time as parameter

* Fix tests
2024-05-13 15:15:43 +03:00
17fa62d3d3 js : remove un-needed request header from fetchRemote (#2119) 2024-05-13 15:13:19 +03:00
1da5edcde0 cmake : fix metal embed sources path (#2110) 2024-05-13 15:09:59 +03:00
0bb05b113d main : dont print timings with --no-prints (#2108)
Signed-off-by: Daniel Ziegenberg <daniel@ziegenberg.at>
2024-05-13 15:00:19 +03:00
f141b2b938 main : add options for temperature control (#2088)
Add two options:

```
-tp,       --temperature N     [0.00   ] The sampling temperature, between 0 and 1
-tpi,      --temperature-inc N [0.20   ] The increment of temperature, between 0 and 1
```

The sampling temperature, between 0 and 1. Higher values like 0.8 will
make the output more random, while lower values like 0.2 will make it
more focused and deterministic. If set to 0, the model will use log
probability to automatically increase the temperature until certain
thresholds are hit.

Signed-off-by: Daniel Ziegenberg <daniel@ziegenberg.at>
2024-05-13 14:59:44 +03:00
2b434c449e whisper : switch back to F32 mask (#0) 2024-05-13 14:43:43 +03:00
e93081f83f whisper.android : update example, add field to print timestamp (#2072) 2024-05-13 14:30:03 +03:00
b6bbce4ae9 cmake : fix json INTERFACE library (#2069) 2024-05-13 14:29:39 +03:00
7705dc52da main : fix double quote escaping in csv output (#2090) 2024-05-13 11:55:32 +03:00
e6acaf9d91 metal : tune soft_max number of threads (#0) 2024-05-13 11:02:26 +03:00
2c81e6fd51 whisper : remove old flash attn code (#0) 2024-05-13 11:02:26 +03:00
9506267ce5 ggml : try fix ppc64 (#0) 2024-05-13 11:02:26 +03:00
fbeb80b5f0 ggml : remove oboslete alibi code (skipme) (#0) 2024-05-13 11:02:26 +03:00
3fa7d29876 talk-llama : sync llama.cpp 2024-05-13 11:02:26 +03:00
fe179ae0cc sync : ggml 2024-05-13 11:02:26 +03:00
40aeeeecc4 ggml : optimize for ppc64le using VSX intrinsics (ggml/784)
* optimize for ppc64le using VSX intrinsics

* 1. code clean up by removing comments about overflow concern.

2. fix typo in suffix of scaling.

* Continue to fix typo in suffix of scaling for QK_K <> 256

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-13 11:02:26 +03:00
5a863fbe18 metal : fix indent (ggml/0) 2024-05-13 11:02:26 +03:00
91c646c61d ggml : restore sigmoid decl order (ggml/0) 2024-05-13 11:02:26 +03:00
accada542a ggml : resolve merge (ggml/0)
ggml-ci
2024-05-13 11:02:26 +03:00
e54329da7b ggml : full ALiBi support (llama/7192)
* ggml : full ALiBi support

* ggml : update ggml_soft_max_ext() CUDA, SYCL

* ggml : ggml_flash_attn_ext() support ALiBi (CPU)

* ggml : ggml_flash_attn_ext() support ALiBi (Metal)

* ggml : fix warning

* ggml : ggml_flash_attn_ext() support ALiBi (CUDA)

ggml-ci

* ggml : fix assert message

* vulkan : add dev notes

* ggml : require mask when using ALiBi

ggml-ci

* convert : fix convert for refact models
2024-05-13 11:02:26 +03:00
284fac39fb metal : fix flash attention kernel requirements (llama/7169)
* metal : fix flash attention kernel requirements

ggml-ci

* metal : fix ggml_metal_supports_op

ggml-ci
2024-05-13 11:02:26 +03:00
fe454b8d9e Minor arithmetic improvement to mmvq wrapper kernel (llama/7172) 2024-05-13 11:02:26 +03:00
c114b75aee Vulkan Bugfixes and Improvements (llama/7084)
* Modify mat mat mul shader for mul_mat_id, modify mat vec mul shaders for single call batch operation

* Further work towards MoE, disabled for now

* Disable MoE code (not ready yet), fix a number of bugs in shaders and Vulkan code

* Add softmax with f16 mask and pos buffer support

* Disable mul_mat_id shaders for now

* Fix flake8

* Fix validation errors caused by empty buffers on larger batch sizes
2024-05-13 11:02:26 +03:00
4be936b88b CUDA: generalize FP16 fattn vec kernel (llama/7061)
* CUDA: generalize FP16 fattn vec kernel

* disable unsupported head sizes for AMD in test

* try AMD fix

* fix batch size 2-8

* partially revert changes
2024-05-13 11:02:26 +03:00
26c550f772 opencl : alignment size converted from bits to bytes (llama/7090)
* opencl alignment size should be converted from bits to bytes

Reference: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#CL_DEVICE_MEM_BASE_ADDR_ALIGN

> Alignment requirement (in bits) for sub-buffer offsets.

* Update ggml-opencl.cpp for readability using division instead of shift

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-05-13 11:02:26 +03:00
24f0aa460b Introduction of CUDA Graphs to LLama.cpp (llama/6766)
* DRAFT: Introduction of CUDA Graphs to LLama.cpp

* FIx issues raised in comments

* Tidied to now only use CUDA runtime (not mixed with driver calls)

* disable for multi-gpu and batch size > 1

* Disable CUDA graphs for old GPU arch and with env var

* added missing CUDA_CHECKs

* Addressed comments

* further addressed comments

* limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake

* Added more comprehensive graph node checking

* With mechanism to fall back if graph capture fails

* Revert "With mechanism to fall back if graph capture fails"

This reverts commit eb9f15fb6fcb81384f732c4601a5b25c016a5143.

* Fall back if graph capture fails and address other comments

* - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS

- rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS

- updated Makefile build to enable CUDA graphs

- removed graph capture failure checking in ggml_cuda_error
  using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string
  if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context

- fixed several resource leaks

- fixed issue with zero node graphs

- changed fixed size arrays to vectors

- removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed

- removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row

- changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX

- code style fixes

- things to look into
  - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional
  - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes

* fix build without cuda graphs

* remove outdated comment

* replace minimum cc value with a constant

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-13 11:02:26 +03:00
69efc39d5c metal : use vm_allocate instead of posix_memalign on macOS (llama/7078)
* fix: use `malloc` instead of `posix_memalign` in `ggml-metal.m` to make it not crash Electron proccesses

* fix: typo

* fix: use `vm_allocate` instead of `posix_memalign`

* fix: don't call `newBufferWithBytesNoCopy` with `NULL` when `ggml_metal_host_malloc` returns `NULL`

* fix: use `vm_allocate` only on macOS
2024-05-13 11:02:26 +03:00
a2ad810118 ggml : introduce bfloat16 support (llama/6412)
* Introduce bfloat16 support

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───┐
    0b0000000000000000 brain16

This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───────────────────┐
    0b00000000000000000000000000000000 IEEE binary32

The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others

      ┌sign
      │
      │  ┌exponent
      │  │
      │  │    ┌mantissa
      │  │    │
      │┌─┴─┐┌─┴──────┐
    0b0000000000000000 IEEE binary16

This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16

* Remove GGML code that's not needed

* Minimize the GGML API surface area for BF16

* Remove bf16 luts

* Make the GGML header look nicer

* Fix documentation

* Apply ggerganov's fixes for test-backend-ops

* Add BF16 code for new ggml_validate_row_data() function
2024-05-13 11:02:26 +03:00
1ae1a9cd56 metal : fix unused warning 2024-05-13 11:02:26 +03:00
b5521fea19 Add an option to build without CUDA VMM (llama/7067)
Add an option to build ggml cuda without CUDA VMM
resolves
https://github.com/ggerganov/llama.cpp/issues/6889
https://forums.developer.nvidia.com/t/potential-nvshmem-allocated-memory-performance-issue/275416/4
2024-05-13 11:02:26 +03:00
9b84195225 gguf-split: add --no-tensor-first-split (llama/7072) 2024-05-13 11:02:26 +03:00
11c1df0436 CUDA: CUDART < 11.7 workaround for __hmax, __hmax2 (llama/7019) 2024-05-13 11:02:26 +03:00
c754494fdd switch to using localizedDescription (llama/7010) 2024-05-13 11:02:26 +03:00