Commit Graph

1615 Commits

Author SHA1 Message Date
7bd69349bf rpc : add command line arg for specifying backend memory
ref: #7293
2024-06-16 18:19:48 +03:00
488ad99c13 Add support for properly optimized Windows ARM64 builds with LLVM and MSVC (llama/7191)
* logging: add proper checks for clang to avoid errors and warnings with VA_ARGS

* build: add CMake Presets and toolchian files for Windows ARM64

* matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings

* ci: add support for optimized Windows ARM64 builds with MSVC and LLVM

* matmul-int8: fixed typos in q8_0_q8_0 matmuls

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* matmul-int8: remove unnecessary casts in q8_0_q8_0

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
7178cceeaa ggml : use dynamic thread scheduling for matrix multiplication (llama/6915)
* Just reordering some structs.

* Adding in the calls to mm_pause

* Passing around the state

* Renaming and moving a bunch of variables around.

* Extracting the logic to it's own function.

* Moving some variable definitions into the chunk function.

* Moving some variables around

* moving src1_cont inside

* Moving row_size

* adding the current_chunk

* Reorg the code.

* Formatting to match the orig patch

* starting to setup the chunking variables

* Starting the buildup of the loop

* The yield shouldn't be necessary.

* adding the looping structure based on the chunk configuration.

* Add in the re-chunking code.

* Making it much more likely to rechunk.

* disable resizing if numa is enabled.

* Updating comments with what we've learned.

* Fix formatting

* Couple more formatting fixes.

* More style fixes.

* Fix Warnings

* Going with unused because there's conditional logic that needs it.

* Update ggml.c

* Update ggml.c

---------
2024-06-16 18:19:48 +03:00
8d55ccdb8c Avoid unnecessarily disabling CUDA graphs (llama/7302)
As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts.
This fixes the issue by avoiding the consective update counter from incrementing unnecessarily
for tokens in which cuda graphs are disabled due to batch size > 1.
2024-06-16 18:19:48 +03:00
37a72cb170 ggml : tag ggml_tensor::backend as deprecated (llama/7290) 2024-06-16 18:19:48 +03:00
bf9b69284f Add missing " (llama/7303) 2024-06-16 18:19:48 +03:00
c4de1e19df ggml : add ggml_upscale_ext (ggml/814)
* initial commit with CPU implementation of upscale to shape and test, cuda implementation next

* experimental commit to see if dst shape is correct

* test version

* test

* removed unnecessary params

* refactor

* fixed tests

* ggml : metal impl + cleanup + sycl dev warnings

* patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior

* metal : fix upsacle op to support nb00 + style

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
5b7073cae1 scripts : update sync 2024-06-16 12:41:42 +03:00
b29b3b2924 whisper : use ggml-cuda in mel calc, set appropriate device (#2236)
* whisper : use ggml-cuda in mel calc, set appropriate device

* whisper : forbid cuda mel calc on devices with compute < 600, workaround for #2230
2024-06-13 13:16:07 +03:00
420b6abc54 cuda : fix HIPBLAS build (#2234) 2024-06-11 19:14:38 +03:00
99804b0f3e cuda : fix bounds check for src0 rows in MMVQ kernel (#2231)
* cuda : fix bounds check for src0 rows in MMVQ kernel

* Update ggml-cuda/mmvq.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-06-11 17:39:01 +03:00
c55964c956 ci : fix CUDA builds (#2232) 2024-06-11 17:21:30 +03:00
20c542c713 whisper : auto-grow working areas for mel_calc_cuda (#2227)
* whisper : auto-grow working areas for mel_calc_cuda, fixes #2226

* whisper : only calculate mel spectrogram on GPU if audio is <= 5 min
2024-06-10 21:51:32 +03:00
c2bdb960cd whisper : free whisper_mel instances (#2220) 2024-06-10 11:00:15 +03:00
87acd6d629 whisper : whisper_state/backend fixes (#2217)
* whisper : fixes

* ci : WHISPER_CUBLAS -> WHISPER_CUDA
2024-06-06 18:51:36 +03:00
f842d31171 whisper : calculate mel spectrogram directly into a ggml_tensor (#2208)
* whisper : calculate mel spectrogram directly into a ggml_tensor

* whisper : remove unused temp buffer from state

* whisper : fix not initializing wstate.embd_enc
2024-06-06 16:20:46 +03:00
ffef323c4c whisper : add CUDA-specific computation mel spectrograms (#2206)
* whisper : use polymorphic class to calculate mel spectrogram

* whisper : add cuda-specific mel spectrogram calculation

* whisper : conditionally compile cufftGetErrorString to avoid warnings

* build : add new files to makefile

* ruby : add new files to conf script

* build : fix typo in makefile

* whisper : suppress cub warning for deprecated C++ std in whisper-mel-cuda
2024-06-04 09:32:23 +03:00
af5833e298 whisper : remove speed_up and phase_vocoder* functions (#2198)
* whisper : fix cast warning

* whisper : remove phase_vocoder functions, ref #2195

* whisper : remove speed_up from whisper_full_params, closes #2195
2024-05-31 11:37:29 +03:00
b87494bb8f readme : add conan badge (#2196)
* Add conan badge

* Fix markdown formating
2024-05-30 15:43:28 +03:00
ad130431aa readme : add install instructions for Conan (#2189) 2024-05-30 15:06:15 +03:00
e130b66642 whisper: use global cache for sin/cos vals and Hann window (#2194)
- also rename Hanning to Hann as it's named after Julius von Hann
 as per Wikipedia
2024-05-29 19:09:21 +03:00
c7b6988678 release : v1.6.2 v1.6.2 2024-05-27 10:35:09 +03:00
05042a782d Revert "whisper : remove extra backend instance (huh?)" (#2182)
This reverts commit 4caa64b73e.
2024-05-27 10:20:25 +03:00
a7dc2aab16 server : fix typo (#2181)
A simple comment typo, PR can be dismissed
2024-05-25 10:46:22 +03:00
22d46b7ba4 ruby : update bindings (#2154)
* update library files

* update whispercpp

* not needed for gem
2024-05-22 23:02:52 +03:00
c10db6ea28 release : v1.6.1 v1.6.1 2024-05-21 18:44:37 +03:00
1b51fdf170 examples : add support for decoding input with ffmpeg (Linux) (#2133)
- search for ffmpeg libs/headers at cmake time
- added ffmpeg-transcode.cpp into libcommon if ffmpeg on
- hooked ffmpeg trancoding in common read_wav(...)
- passed test:
./main -m ggml-base.en.bin -f samples/jfk.mp3
2024-05-21 18:31:41 +03:00
adee3f9c1f node : add flash_attn param (#2170) 2024-05-20 09:08:48 +03:00
4798be1f9a ci: Update build.yml to suppress warnings about node.js versions (#2166)
* Update actions to suppress warnings about old node.js

https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/

* Update actions/upload-artifact, specify android cmdline-tools-version

* Use java 20

gradle 8.1 complains against 21
https://docs.gradle.org/current/userguide/compatibility.html
2024-05-19 11:49:26 +03:00
08981d1bac release : v1.6.0 v1.6.0 2024-05-15 09:59:48 +03:00
7094ea5e75 whisper : use flash attention (#2152)
* whisper : use flash attention in the encoder

* whisper : add kv_pad

* whisper : remove extra backend instance (huh?)

* whisper : use FA for cross-attention

* whisper : use FA for self-attention

* whisper : simplify encoder FA

* whisper : add flash_attn runtime parameter

* scripts : add bench log

* scripts : add M1 Pro bench log
2024-05-15 09:38:19 +03:00
9d5771ae43 talk-llama : reject runs without required arguments (#2153)
* Extended talk-llama example to reject runs without required arguments.

Print warning and exit if models are not specified on the command line.

* Update examples/talk-llama/talk-llama.cpp

* Update examples/talk-llama/talk-llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-14 21:32:41 +03:00
f56b8305c4 sync : ggml 2024-05-14 19:16:32 +03:00
1056ad762c metal : support FA without mask + add asserts (llama/7278)
* ggml : fa without mask + add asserts

ggml-ci

* metal : support non-contiguous KV

ggml-ci
2024-05-14 19:16:29 +03:00
c451080c8b ggml : add RPC backend (llama/6829)
* ggml : add RPC backend

The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).

* set TCP_NODELAY

* add CI workflows

* Address review comments

* fix warning

* implement llama_max_devices() for RPC

* Address review comments

* Address review comments

* wrap sockfd into a struct

* implement get_alignment and get_max_size

* add get_device_memory

* fix warning

* win32 support

* add README

* readme : trim trailing whitespace

* Address review comments

* win32 fix

* Address review comments

* fix compile warnings on macos
2024-05-14 19:16:29 +03:00
8e7c22fbdb rm wait() (llama/7233) 2024-05-14 19:16:29 +03:00
e57e95eb0d CUDA: add FP32 FlashAttention vector kernel (llama/7188)
* CUDA: add FP32 FlashAttention vector kernel

* fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel
2024-05-14 19:16:29 +03:00
130f43e4b8 scripts : sync ggml-rpc 2024-05-14 19:15:35 +03:00
d8356a1cc2 whisper : fix model path encoding in windows (#2086)
* fix: model path encoding in windows

* fix: convert model path to wide string only for MSVC compiler
2024-05-14 09:43:41 +03:00
4ef8d9f44e server : return utf-8 (#2138) 2024-05-13 15:33:46 +03:00
3928dbd206 node : add audio_ctx and audio buffer params (#2123)
* node : add audio_ctx param

* node : support passing audio buffer directly

* node : parse audio_ctx in index.js

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-13 15:22:23 +03:00
2ced6f0742 cmake : fix HIP/ROCm build (#2102) 2024-05-13 15:18:43 +03:00
30f73109b8 node : add additional params (#2000)
* Add additional params to addon.node

* Add comma_in_time as parameter

* Fix tests
2024-05-13 15:15:43 +03:00
17fa62d3d3 js : remove un-needed request header from fetchRemote (#2119) 2024-05-13 15:13:19 +03:00
1da5edcde0 cmake : fix metal embed sources path (#2110) 2024-05-13 15:09:59 +03:00
0bb05b113d main : dont print timings with --no-prints (#2108)
Signed-off-by: Daniel Ziegenberg <daniel@ziegenberg.at>
2024-05-13 15:00:19 +03:00
f141b2b938 main : add options for temperature control (#2088)
Add two options:

```
-tp,       --temperature N     [0.00   ] The sampling temperature, between 0 and 1
-tpi,      --temperature-inc N [0.20   ] The increment of temperature, between 0 and 1
```

The sampling temperature, between 0 and 1. Higher values like 0.8 will
make the output more random, while lower values like 0.2 will make it
more focused and deterministic. If set to 0, the model will use log
probability to automatically increase the temperature until certain
thresholds are hit.

Signed-off-by: Daniel Ziegenberg <daniel@ziegenberg.at>
2024-05-13 14:59:44 +03:00
2b434c449e whisper : switch back to F32 mask (#0) 2024-05-13 14:43:43 +03:00
e93081f83f whisper.android : update example, add field to print timestamp (#2072) 2024-05-13 14:30:03 +03:00
b6bbce4ae9 cmake : fix json INTERFACE library (#2069) 2024-05-13 14:29:39 +03:00