Commit Graph

2328 Commits

Author SHA1 Message Date
ec166499d8 whisper : improve beam search candidate diversity (#1947)
As of #1486, whisper.cpp uses a unified KV cache with KQ masking.
As a result, depending on their location in the batch,
identical sequences in a batch can have slightly different outputs
due to floating point rounding errors during reduction.
See the discussion in #1941 for more details.

The beam search code used "has identical sum of log probabilities"
as a shorthand for "is an identical token sequence". However, per above,
identical tokens do not necessarily result in identical probabilities.

Instead, explicitly compare on sequences.
This is linear in cost when they are identical,
but the lengths are always small and the comparisons are cheap.

This increases diversity during beam search.

This improves output quality for some short samples I've been working
with, at no detectable performance cost.
I haven't checked against larger corpuses.

Fixes #1941
2024-03-10 16:54:43 +02:00
ccf022f970 bindings/go : add linker flags to make metal work (#1944)
The first two are required to build.
The last one is to make it actually detect the GPU.

Fixes #1899, at least for me
2024-03-09 18:50:44 +02:00
2852e1af55 whisper : make beam candidate sort more stable (#1943)
All else being otherwise equal, this encourages the beam candidate
selection to re-use the same decoder, which slightly
reduces the cache size.

I wouldn't expect it to make much of a performance difference,
but it helps when debug printing the cache and beam.

Added as part of understanding #1941.
2024-03-09 18:50:03 +02:00
ce945b50c3 ggml : try fix 32-bit arm compat (#1938)
* ggml : try fix 32-bit arm compat

* ggml : fix cont
2024-03-08 23:45:07 +02:00
2f5a5a66dd talk-llama : use llama_decode instead of llama_eval 2024-03-08 12:04:43 +02:00
8e409d1113 talk-llama : sync llama.cpp 2024-03-08 11:55:50 +02:00
05d1b61af4 talk-llama : sync llama.cpp 2024-03-08 11:52:47 +02:00
647cae178a sync : ggml 2024-03-08 11:39:34 +02:00
bae7c23fbf Revert "[SYCL] fix error when set main gpu to non-zero (llama/5901)" (llama/5918)
This reverts commit ceca1aef0738b57951cd12c603c3477e75312dec.
2024-03-08 11:38:33 +02:00
18ea187d42 fix error when set main gpu to non-zero (llama/5901)
* fix error when set main gpu to non-zero

* fix delete condition
2024-03-08 11:38:33 +02:00
1daeffca54 ggml : use SYS_get_cpu if SYS_getcpu is not defined (llama/5906)
Fixes #5694
Fixes ggerganov/whisper.cpp#1894
2024-03-08 11:38:33 +02:00
2f6f1d4465 ggml : use uint8x16_t return type for ggml_vqtbl1q_u8 (llama/5894)
* use uint8x16_t

* Update ggml-quants.c
2024-03-08 11:38:33 +02:00
7ff1894c34 add wait() to make code stable (llama/5895) 2024-03-08 11:38:33 +02:00
8edfc54c2b quants : use MM256_SET_M128I consistently to fix gcc 7 build (llama/5889) 2024-03-08 11:38:33 +02:00
9c399689ec Vulkan Improvements (llama/5835)
* Improve dequant shaders, add fast q4_0 dequant

* Optimize dmmv non-kquants for GCN

Remove unnecessary SPIR-V shader duplication

* Fix q4_0 dequant dispatch sizes

Fix backend free bug

* Optimize dequant shaders for q4_1, q5_0, q5_1 and q8_0

* Add unary and binary op shader templates

* Fix Vulkan check results

* Enable non-contiguous support for simple ops

* Add argsort

Basic q4_0 mmq shader and unit test

* Speed up q4_0 dequant code, enable mmq for q4_0

* Rework matmul pipeline selection

* Add soft_max alibi support

* Add q4_1, q5_0, q5_1 and q8_0 dequant mat mat mul shaders

* Add environment variable GGML_VK_FORCE_MAX_ALLOCATION_SIZE to limit max buffer size

Rename GGML_VULKAN_DISABLE_F16 to GGML_VK_DISABLE_F16 for consistency
2024-03-08 11:38:33 +02:00
9d9a405cfd fix mul_mat fault in CI/unit-test (llama/5862)
* fix mul_mat fault in cpy_f32_f16

* rm unused function

* add wait() for memcpy

* restore ci/run.sh, rename struct defination, fix bug in ggml_sycl_op_mul_mat_sycl

* fix format issue

* llama : fix segfault from unknown model arch name (llama/5820)

* llama : fix segfault from unknown model arch name

* llama : make all LLM maps const

This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.

* llama : name LLM_ARCH_UNKNOWN to "(unknown)"

This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284

* llama : remove redundant inner const for LLM_TENSOR_NAMES

The extra const won't do anything here as const maps
return const references to values.

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* llama : remove redundant nullptr check in llm_arch_from_string

Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* llama : refactor internal quantization functions (llama/5830)

* scripts : add pod-llama.sh

* ggml : IQ3_S improvements (llama/5829)

* iq3_s: somewhat faster AVX2 dot product

On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.

* iq3_s: somewhat faster ARM_NEON dot product

Still dog slow - 10.7 t/s up from 9.9 t/s.

* iq3_s: another small ARM_NEON improvement

10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.

* iq3_s: minor improvement on Metal

49.4 t/s -> 50.3 t/s

* iq3_s: PPL improvement

E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.

* iq3_s: use new grid everywhere

* Fix ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* convert-hf : make model class definitions self-contained (llama/5825)

* convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (llama/5821)

* ggml : fix IQ3_S AVX implementation (llama/5834)

ggml-ci

* llama : add abort_callback to interrupt computation (llama/5409)

* using abort_callback from ggml to stop llama computation

* format fix

* a brief explaining comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: tests: passkey challenge /  self-extend with context shift demo (llama/5832)

* server: tests: add models endpoint scenario

* server: /v1/models add some metadata

* server: tests: add debug field in context before scenario

* server: tests: download model from HF, add batch size

* server: tests: add passkey test

* server: tests: add group attention params

* server: do not truncate prompt tokens if self-extend through group attention is enabled

* server: logs: do not truncate log values

* server: tests - passkey - first good working value of nga

* server: tests: fix server timeout

* server: tests: fix passkey, add doc, fix regex content matching, fix timeout

* server: tests: fix regex content matching

* server: tests: schedule slow tests on master

* server: metrics: fix when no prompt processed

* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1

* server: tests: increase timeout for completion

* server: tests: keep only the PHI-2 test

* server: tests: passkey add a negative test

* flake.lock: Update (llama/5842)

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
  → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* server : init http requests thread pool with --parallel if set (llama/5836)

* ci : schedule slow server tests only on Release or on demand (llama/5839)

* llama : fix llama_copy_state_data with fragmented KV cache (llama/5840)

The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.

Existing session files should still work.

* llama : fix llama_kv_cache_cell_max inability to return 1

I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.

* llama : fix state size calculation

Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.

* gguf-dump : support i-quants (llama/5841)

Co-authored-by: Black_Fox <radekliska@gmail.com>

* llama : allow for user specified embedding pooling type (llama/5849)

* allow for user specified pooling type

* llama : use enum types over int

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* readme : add API changes section

* cuda : fix data race in soft max (llama/5853)

* main : support special tokens as reverse/anti prompt (llama/5847)

* Support special tokens as reverse/anti prompt.

* Tokenize antiprompts only once.

* main : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* common : use LLAMA_DEFAULT_SEED (llama/5855)

* add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)

* cuda: fix group_norm

* cuda: add batch inference support for ggml_pad/ggml_upscale

* add ggml_arrange

* add ggml_timestep_embedding

* update ggml_arange/ggml_timestep_embedding tests

* cuda: fix im2col

* add ggml_arange/ggml_timestep_embbeding support for metal backend

* fix some bugs

* fix some bugs

* Update ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* modify according to the review comments

* ggml : fix compile warnings + code style

* ggml : normalize compute_forward calls + fix seg fault in debug

* minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>

* sync : ggml

* add alias for chat template (llama/5858)

* speculative : implement stochastic speculative sampling (llama/5625)

* (WIP) Implement stochastic speculative decoding

* sample from residual distribution on draft accept failure

* fix #5657: force greedy sampling with probs when temp is 0

* remove p_accept parameter

* fix style

* remove unused variables

* add srand() in speculative.cpp

* replace use of rand() with mt19937 sampling

* fixes based on review (@JohannesGaessler)

* fix r random generation

* randomly select next sequence to verify + fix bug in memory freeing

* fix bug in active_seqs sync

* fix uniform int distribution initialization

* remove warnings from comparison between int and size_t

* check grammar in `llama_sample_probability_distribution_impl`

* remove malloc code by utilizing vectors

* add PR link to README

* cmake : handle cases where git index is not found in .git (llama/5844)

* Update CMakeLists.txt

* Update CMakeLists.txt

* ggml : introduce ggml_status (ggml/750)

* using enum as an exit code instead of macros

* update return type from enum to unsigned int

* indentation fix

* compound update
ggml_compute_exit_code -> ggml_status
changed ggml_status from a bit-field type to simple codes
ggml_status to string cast

* ggml_status to string cast

* GGML_CALL was removed

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* sync : ggml

ggml-ci

* ggml : fix unknown status (llama/0)

* flake : fix

* llama : fix embeddings (llama/5796)

* llama : fix embeddings

ggml-ci

* llama : do not use KV cache for non-causal models

ggml-ci

* embeddings : fix llama_batch_init arg

* llama : add pooling switch

* llama : distinguish token vs sequence embeddings

ggml-ci

* llama : assert pooling tensor

* llama : simplify causal mask condition

ggml-ci

* llama : assert input batch with pooling enabled

* readme : update API changes list

* nix: static build (llama/5814)

* fix speculative decoding build on windows (llama/5874)

* rebase and rm tailing space

---------

Co-authored-by: LiangtaoJin <liang-tao.jin@intel.com>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>
Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Nindaleth <Nindaleth@users.noreply.github.com>
Co-authored-by: Black_Fox <radekliska@gmail.com>
Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: DAN™ <dranger003@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
Co-authored-by: Minsoo Cheong <54794500+mscheong01@users.noreply.github.com>
Co-authored-by: Dane Madsen <dane_madsen@hotmail.com>
Co-authored-by: hutli <6594598+hutli@users.noreply.github.com>
Co-authored-by: Jeffrey Quesnelle <emozilla@nousresearch.com>
2024-03-08 11:38:32 +02:00
edd8b38a75 ggml : fix unknown status (llama/0) 2024-03-08 11:38:32 +02:00
ed76818700 whisper : fix compute helper return (ggml/750) 2024-03-08 11:38:32 +02:00
9a0b59d990 ggml : introduce ggml_status (ggml/750)
* using enum as an exit code instead of macros

* update return type from enum to unsigned int

* indentation fix

* compound update
ggml_compute_exit_code -> ggml_status
changed ggml_status from a bit-field type to simple codes
ggml_status to string cast

* ggml_status to string cast

* GGML_CALL was removed

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-08 11:38:32 +02:00
93a84a143b cuda : fix data race in soft max (llama/5853) 2024-03-08 11:38:32 +02:00
bd26876267 ggml : fix IQ3_S AVX implementation (llama/5834)
ggml-ci
2024-03-08 11:38:32 +02:00
21d295180d ggml : IQ3_S improvements (llama/5829)
* iq3_s: somewhat faster AVX2 dot product

On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.

* iq3_s: somewhat faster ARM_NEON dot product

Still dog slow - 10.7 t/s up from 9.9 t/s.

* iq3_s: another small ARM_NEON improvement

10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.

* iq3_s: minor improvement on Metal

49.4 t/s -> 50.3 t/s

* iq3_s: PPL improvement

E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.

* iq3_s: use new grid everywhere

* Fix ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-08 11:38:32 +02:00
c3bfc9bfda Support multiple GPUs (split mode) on SYCL backend (llama/5806)
* suport multiple cards: split-mode - layer|row

* rm warning

* rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test

* update news

* fix merge error

* update according to review comments
2024-03-08 11:38:32 +02:00
422a6b16fc ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (llama/5813) 2024-03-08 11:38:32 +02:00
11dd0d4482 Use batched mul_mat pathway (llama/5591)
* Use batched mul_mat pathway

* rm extra line

* Explicitly state scaled data type

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-03-08 11:38:31 +02:00
Eve
26dd2f06ac make portability_enumeration_ext apple only (llama/5757) 2024-03-08 11:38:31 +02:00
8cee7c08b6 add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)
* cuda: fix group_norm

* cuda: add batch inference support for ggml_pad/ggml_upscale

* add ggml_arrange

* add ggml_timestep_embedding

* update ggml_arange/ggml_timestep_embedding tests

* cuda: fix im2col

* add ggml_arange/ggml_timestep_embbeding support for metal backend

* fix some bugs

* fix some bugs

* Update ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* modify according to the review comments

* ggml : fix compile warnings + code style

* ggml : normalize compute_forward calls + fix seg fault in debug

* minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-03-08 11:38:31 +02:00
2e2626b167 examples : Auto lowercase language parameter in main.cpp (#1928)
* Auto lowercase language parameter

* Update examples/main/main.cpp

Co-authored-by: bobqianic <129547291+bobqianic@users.noreply.github.com>

---------

Co-authored-by: bobqianic <129547291+bobqianic@users.noreply.github.com>
2024-03-06 22:25:10 +00:00
c0c0ae2dea examples : fix typo in bench.cpp (#1933) 2024-03-06 22:21:44 +00:00
897412b5b6 whisper : fix typo (#1925) 2024-03-05 17:06:31 +02:00
f22d27a385 whisper.android.java : fix returns in JNI (#1929) 2024-03-05 15:59:26 +02:00
ccd7c1d2da cmake : add library versioning (#1352)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-04 21:17:48 +02:00
c713eb5e2a readme : recommend MacOS Sonoma for Core ML (#1917) 2024-03-04 21:16:13 +02:00
25d313b38b talk-llama : sync llama.cpp 2024-02-28 13:04:05 +02:00
3168dbf23b sync : ggml 2024-02-28 13:01:33 +02:00
1711bb3881 sync : llama.cpp (ggml/0) 2024-02-28 13:00:30 +02:00
2533305596 ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (llama/5760)
* WIP: make i-quants work for QK_K = 64

* iq2_xs: attempt to fix AVX dot product for QK_K = 64

Tests pass, but I get gibberish.

* QK_K = 64 tests pass on ARM_NEON and Metal

Sadly, that does not mean it actually works.

* Make CUDA compile with QK_K = 64

Tests don't pass, plus we get misaligned access

* Q2_K: fixed bug in imatrix quantization for QK_K = 64

* iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-28 13:00:30 +02:00
0eca512ac8 Attempt to fix android build (llama/5752)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-28 13:00:30 +02:00
013e394a4b IQ4_XS: a 4.25 bpw quantization (llama/5747)
* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-28 13:00:29 +02:00
d83f371b5f cuda : replace remaining shfl_xor with calls to warp_reduce functions (llama/5744) 2024-02-28 13:00:29 +02:00
1c71816eab ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (llama/5742) 2024-02-28 13:00:29 +02:00
7b1d8ea7e0 Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (llama/5721)
* Adding IQ2_S and IQ2_M as a single cumulative commit

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-28 13:00:29 +02:00
b1f7223a0a CUDA: fix DEBUG_CUDA_MALLOC (llama/5729) 2024-02-28 13:00:29 +02:00
8408a4be8e Add support for soft_max ALiBi (llama/5639)
* Add support for bias

* Update pre-processor

* rm commented code

* fix format

* fix CI

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-28 13:00:29 +02:00
72849c24ba ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (llama/5711)
* [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility

vqtbl1q_u8 is not part of arm v7 neon library

* [android-example] Remove abi filter after arm v7a fix

* [github-workflows] Do not skip Android armeabi-v7a build
2024-02-28 13:00:28 +02:00
c19c28be71 add google magika inference example (ggml/748)
* add magika inference example

* ggml : fix unaligned accesses in custom ops

* ggml : fix FP32 GELU for values that exceed the FP16 range

* use ggml_pool_1d

* add README

* Update README.md

* pad inputs if the files are too small

* cleanup

ggml-ci
2024-02-28 13:00:28 +02:00
0d8fd8483a stream.wasm : fix invalid memory access when no segments (#1902)
No segments may be returned when a smaller sample buffer (EG 2048 samples) is sent to the worker.
2024-02-26 10:12:35 +02:00
3170841ed9 talk-llama : sync llama.cpp 2024-02-25 20:00:10 +02:00
7a6e385c1b sync : ggml 2024-02-25 19:59:34 +02:00
578e47e70c sync : llama.cpp (ggml/0) 2024-02-25 19:58:46 +02:00