Commit Graph

1615 Commits

Author SHA1 Message Date
00a0947c65 metal : unify mul_mv_id kernels (llama/6556) 2024-05-13 11:02:26 +03:00
60f3713026 llama : add gguf_remove_key + remove split meta during quantize (llama/6591)
* Remove split metadata when quantize model shards

* Find metadata key by enum

* Correct loop range for gguf_remove_key and code format

* Free kv memory

---------

Co-authored-by: z5269887 <z5269887@unsw.edu.au>
2024-05-13 11:02:26 +03:00
37e6757453 feat: implemented sigmoid function (ggml/806)
* added sigmoid function

* implemented metal kernel for sigmoid

* implemented cuda kernel for sigmoid

* added sigmoid unary op and incremented count
2024-05-13 11:02:26 +03:00
8dcefdf4a9 build: fix and ignore msvc warnings (ggml/805) 2024-05-13 11:02:26 +03:00
73d13ad19a ggml : expose SSE3 and SSSE3 for MSVC when AVX is available (#2128) 2024-05-08 18:33:43 +03:00
b6680fab50 build : improve disabling AVX-512 (#2129)
* cmake : make WHISPER_NO_AVX512=ON disable all subsets of AVX-512

Previously it happened only for MSVC, but it makes sense to have the
same behavior for other compilers too.

* make : reorder x86 ISA extensions in chronological order

And update compiler flags at the end to ease modifying conditions.

* make : support WHISPER_NO_AVX512=1 for disabling all AVX-512 subsets.

That way you do not have to override each AVX-512 subset setting
individually if it has been turned on during autodetection.
2024-05-08 18:32:43 +03:00
f760756078 minor: add CMakeSettings.json to gitignore (#2094) 2024-05-08 11:03:21 +03:00
58210d6a76 examples : fix node compilation (#2115)
* node : fix compilation and update examples

* node : fix readme

* Update addon.node test
2024-05-02 22:52:55 +01:00
8fac6455ff make : change GNU make default CXX from g++ to c++ (#2100) 2024-04-28 22:54:21 +01:00
22b6598cc9 Remove unnecessary memory reallocation in fft (#2080)
fft_out needs to be twice the frame_size, not the frame_step.  It is resized in fft() anyway, but this change prevents an unnecessary reallocation.

n_fft must match the mel filter size, so it is best not to calculate it from the framesize.

We only need to get the magnitudes for half the spectrum since the other half is a mirror and not used in the mel filter loop later.
2024-04-28 18:36:12 +01:00
858452d58d models : disable old script (#2079) 2024-04-24 14:56:30 +03:00
7f85e1d7fd whisper : more prominent log message for sub-1s audio (#2065) 2024-04-24 14:46:06 +03:00
b0c3cbf2e8 main : pass nullptr when regex is empty (#2070) 2024-04-17 12:23:47 +03:00
a750868428 readme : add up-to-date repository for Python bindings (#2063)
README
2024-04-16 14:15:52 +03:00
7395c70a74 release : v1.5.5 v1.5.5 2024-04-16 14:08:31 +03:00
9fab28135c server : add dtw (#2044)
* server.cpp: add dtw

* Update examples/server/server.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-15 22:16:58 +03:00
08d3eef97d build : fix embedded Metal library generation (#2045) 2024-04-15 20:23:05 +03:00
1b5439a6c2 node : support no timestamps (#2048)
* fix: node: do not compute timestamps if you do not need them

* feat: add no_timestamps parameter to node addon
2024-04-15 20:03:34 +03:00
c7f95b7ca2 build : detect AVX512 in Makefile, add AVX512 option in CMake (#2043)
* make : add AVX512 detection to Makefile and CMakeLists.txt

* make : autodetect more AVX512 instruction subsets

* cmake : do not default to AVX512, must be enabled explicitly

* cmake : enable a set of AVX512 subsets, when AVX512 is turned on

* make : consolidate AVX512 subsets, add AVX512 VBMI

* cmake : revert to NO AVX512 setting, add settings for AVX512 VNNI and VBMI

* make : re-introduce AVX512VNNI back

* cmake : remove superfluous comment line
2024-04-15 20:02:09 +03:00
5c554c04ff whisper.nvim : fix missing reference to "model" variable (#2049) 2024-04-15 19:41:28 +03:00
c383f091a1 whisper : update grammar-parser.cpp (#2058)
preceeding -> preceding
2024-04-15 19:40:27 +03:00
8f253ef3af sync : ggml 2024-04-09 20:27:55 +03:00
c7dc37f97c license : update copyright notice + add AUTHORS 2024-04-09 20:27:44 +03:00
526332873b llama : add Command R Plus support (llama/6491)
* Add Command R Plus GGUF

* Add Command R Plus GGUF

* Loading works up to LayerNorm2D

* Export new tensors in 1D so they are not quantized.

* Fix embedding layer based on Noeda's example

* Whitespace

* Add line

* Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda)

* dranger003: Fix block index overflow in CUDA dequantizing.

* Reverted blocked multiplication code as it still has issues and could affect other Llama arches

* export norms as f32

* fix overflow issues during quant and other cleanup

* Type convention

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* dranger003: Fix more int overflow during quant.

---------

Co-authored-by: S <seast@Ss-Mac-Studio.local>
Co-authored-by: S <s@example.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 20:26:18 +03:00
1d2721ca72 remove row=1 cond (llama/6532) 2024-04-09 20:26:18 +03:00
219e601dab support/fix OPs GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M (llama/6521) 2024-04-09 20:26:18 +03:00
3b8aade3c2 scripts : update sync 2024-04-09 20:25:50 +03:00
52ccd4a3a8 files : rename ./extra to ./scripts 2024-04-09 20:13:41 +03:00
5275074d37 whisper : fix DTW memory access (#2012)
* Fix DTW memory access

* Memory fix - Apply changes from denersc
2024-04-09 18:38:19 +03:00
c15b4cda7d common : fix file-handle leak in read_wav() (#2026)
Now it cleans up in case of error.
2024-04-09 18:34:34 +03:00
d3cfb6ca2b main : set stdin to binary mode on Windows (#2025) 2024-04-09 18:33:32 +03:00
956ef860bc cmake : support for CPU BLAS build via Intel MKL (#2024) 2024-04-09 18:32:46 +03:00
671b4bde6c main : allow a response-file as the sole parameter (#2019)
* The "main" example now allows a response-file as the sole parameter.

A response-file is a text file with command-line parameters, one per line.
Prefix the name of the response-file with "@" to identify it as such.
It's used under MS Windows to work around command-line length limits.
It may be useful under other platforms to simplify character-escaping.

* minor : style

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 18:31:16 +03:00
c8eeb93a6a whisper : suppress tokens with a regex (#1997)
* Allow a regular expression to describe tokens to suppress.

Example: --suppress-tokens-re "[,\.]|[ ]?[0-9]+" will suppress commas, periods, and numeric tokens.

Technique inspired by https://github.com/openai/whisper/discussions/1041

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Blind change to fix Java test.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 18:27:28 +03:00
319fe5146e cmake : create solution folders (#2004)
* Create solution folders in the CMake build.

* Fixed non-SDL2 build.

* Fixed emscripten build.
2024-04-09 18:23:33 +03:00
13c22321d1 sync : ggml 2024-04-07 17:04:56 +03:00
ccbe9d5676 extra : sync grammar-parser 2024-04-07 17:04:22 +03:00
81a3c41aa0 talk-llama : sync llama.cpp 2024-04-07 16:21:08 +03:00
a50207c65d sync : ggml 2024-04-07 16:18:11 +03:00
97878e53fd sync : llama.cpp (skip)
ggml-ci
2024-04-07 16:15:57 +03:00
61b05815e0 Fixed minor bug when enabling FP16 for non intel targets (llama/6464)
* moved INTEL_MKL guard from gemm_impl to gemm (wrapper)

* Update ggml-sycl.cpp

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>

---------

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>
2024-04-07 16:15:57 +03:00
1dce94cf26 ggml : mul_mat_id use the same tensor for all the experts (llama/6387)
* ggml : update mul_mat_id to use the same tensor for all the experts

* update cuda

* minor

* update metal

* update test-backend-ops

* fix cuda

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update convert.py

* update convert-hf-to-gguf.py

* update convert.py for mixtral hf models

* Update convert-hf-to-gguf.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* cuda : support non-pow-2 number of experts

* allow quantize to work for split and merged experts models in the same way

* cleanup + disable mmap automatically with split tensors models

* update imatrix

* test-backend-ops : test qwen argsort

* update grok model loading

* llama : add merged experts tensors to the grok tensor map

* minor

* gguf : bump version

* fix quantizing of merged experts

* convert-hf-to-gguf.py : update grok (untested)

* make linter happy

* cuda/argsort : use shared memory instead of pool memory

* convert : fix grok tensor names

* metal : add support for non-pow-2 argsort

* llama : more loader cleanup, better error checking

* cuda : fix warning

* llama : still use mmap for loading old models, but copy the data to a host buffer

* add review note

* llama : remove ffn tensor counting + add sanity check

ggml-ci

* convert : fix handling of n_experts == None

ggml-ci

* imatrix : fix ncall counters

* llama : produce error if imatrix size does not match

* quantize : terminate on errors + trace logs

ggml-ci

* metal : pad shared memory to 16 bytes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-07 16:15:57 +03:00
f12e982c0b Disable iqx on windows as WA (llama/6435)
* disable iqx on windows as WA

* array instead of global_memory
2024-04-07 16:15:57 +03:00
fa966b9b40 Vulkan k-quant mmq and ggml-backend offload functionality (llama/6155)
* Fix Vulkan no kv offload incoherence

* Add k-quant mul mat mat shaders

* Rework working buffer allocation, reduces vram use noticeably

Clean up cpu assist code, replaced with ggml-backend offload function

* Default to all dedicated GPUs

* Add fallback for integrated GPUs if no dedicated GPUs are found

* Add debug info which device is allocating memory

* Fix Intel dequant issue

Fix validation issue

* Fix Vulkan GGML_OP_GET_ROWS implementation

* Clean up merge artifacts

* Remove Vulkan warning
2024-04-07 16:15:57 +03:00
b83a9fc9d3 fix set main gpu crash (llama/6339) 2024-04-07 16:15:56 +03:00
3adbf2fb03 ggml : fix bounds checking of zero size views (llama/6347) 2024-04-07 16:15:56 +03:00
700d146127 backend : fix typo in scheduler documentation (ggml/781)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-07 16:15:56 +03:00
a74fde9b4c extra : sync ggml-cuda folder 2024-04-07 16:10:44 +03:00
1d7657f409 ggml: bypass code incompatible with CUDA < 11.1 (#2020)
`cudaHostRegisterReadOnly` parameter was only introduced in CUDA 11.1

See this issue for more details:
https://github.com/ggerganov/whisper.cpp/issues/2007
2024-04-04 14:49:24 +02:00
ac283dbce7 ci : add building in MSYS2 environments (Windows) (#1994) 2024-03-30 09:20:20 +02:00