Compare commits

..

1071 Commits

Author SHA1 Message Date
6266a9f9e5 release : v1.7.2 2024-11-19 18:54:22 +02:00
d24f981fb2 sycl: fix example build (#2570) 2024-11-18 14:57:23 +02:00
01d3bd7d5c ci : use local ggml in Android build (#2567) 2024-11-16 20:45:41 +02:00
bb12cd9b77 ggml : tmp workaround for whisper.cpp (skip) (#2565) 2024-11-16 20:21:24 +02:00
f02b40bcb4 update : readme 2024-11-15 16:00:10 +02:00
83ac2842bd scripts : fix sync path 2024-11-15 15:24:09 +02:00
c4e95fb74d whisper.swiftui : switch Mac dest to Mac (Designed for iPad) (#2562) 2024-11-15 15:21:53 +02:00
e23721f3fb cmake : fix ppc64 check (#0) 2024-11-15 15:21:04 +02:00
c0a9f8ef85 whisper : include ggml-cpu.h (#0) 2024-11-15 15:21:04 +02:00
6477b84eb6 build : fixes 2024-11-15 15:21:04 +02:00
24d706774d talk-llama : sync llama.cpp 2024-11-15 15:21:04 +02:00
5089ab2d6a whisper : fix build (#0) 2024-11-15 15:21:04 +02:00
bdbb906817 sync : ggml 2024-11-15 15:21:04 +02:00
fa2ebd336e sycl : Fixes to broken builds and test-backend-ops (llama/10257)
* Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133)

* Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops

* Fixes asserts in norm to fix debug builds.
2024-11-15 15:21:04 +02:00
21b01a21b6 vulkan: Optimize contiguous copies (llama/10254)
* tests: Fix memory bandwidth calculation for perf tests

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

* vulkan: Optimize contiguous copies

Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.
2024-11-15 15:21:04 +02:00
b54ce5edc5 vulkan: Throttle the number of shader compiles during the build step. (llama/10222)
Fixes #9582

Spawning too many concurrent copies of glslc leads to "Failed to create pipes"
errors on Linux. This change applies the same throttling we use for
multithreaded pipeline creation.
2024-11-15 15:21:04 +02:00
26a31b78e9 metal : more precise Q*K in FA vec kernel (llama/10247) 2024-11-15 15:21:04 +02:00
14d13c5f9f vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (llama/10226) 2024-11-15 15:21:04 +02:00
5e110c2eb5 metal : reorder write loop in mul mat kernel + style (llama/10231)
* metal : reorder write loop

* metal : int -> short, style

ggml-ci
2024-11-15 15:21:04 +02:00
4a9926d521 metal : fix build and some more comments (llama/10229) 2024-11-15 15:21:04 +02:00
ae3c5642d0 metal : fix F32 accumulation in FA vec kernel (llama/10232) 2024-11-15 15:21:04 +02:00
e287a3b627 metal : hide debug messages from normal log 2024-11-15 15:21:04 +02:00
SXX
b890243690 ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213) 2024-11-15 15:21:04 +02:00
b7b38f7d68 ggml : optimize llamafile cpu matrix multiplication for ppc64le (llama/10156)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for FP32 datatype.

This change results in a consistent 90%
improvement in input processing time, and 20%
to 80% improvement in output processing time,
across various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2024-11-15 15:21:04 +02:00
9f67aab211 metal : opt-in compile flag for BF16 (llama/10218)
* metal : opt-in compile flag for BF16

ggml-ci

* ci : use BF16

ggml-ci

* swift : switch back to v12

* metal : has_float -> use_float

ggml-ci

* metal : fix BF16 check in MSL

ggml-ci
2024-11-15 15:21:04 +02:00
8f0f785d88 metal : improve clarity (minor) (llama/10171) 2024-11-15 15:21:04 +02:00
d0b8335789 metal : optimize FA kernels (llama/10171)
* ggml : add ggml_flash_attn_ext_get_prec

* metal : use F16 precision in FA kernels

ggml-ci

* metal : minor clean-up

* metal : compile-guard bf16 FA kernels

ggml-ci

* build : remove obsolete compile flag [no ci]

* metal : prevent int overflows [no ci]

* cuda : disable BF16 FA

ggml-ci

* metal : fix BF16 requirement for FA kernels

ggml-ci

* make : clean-up [no ci]
2024-11-15 15:21:04 +02:00
1550be79f1 ggml : add ggml-cpu.h to the public headers (llama/10204) 2024-11-15 15:21:04 +02:00
807f848c2f fix q4_0_8_8 format for corrupted tokens issue (llama/10198)
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-62-167.us-west-2.compute.internal>
2024-11-15 15:21:04 +02:00
42398f13b0 Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (llama/10133)
* rwkv6: rename to wkv6

* rwkv6: support avx2 avx512 armv8 armv9

* rwkv6: update cuda file name

* rwkv6: rename params

* wkv on sycl

* sycl: add some ops

* sycl: Enhance OP support judgment

* wkv6: drop armv9 and tranfer to GGML style

ggml-ci

* sync : ggml

* update the function to use appropriate types

* fix define error

* Update ggml/src/ggml-cpu.c

* add appropriate asserts

* move element-wise functions outside

* put the declaration outside the loop

* rewrite to be more inline with the common pattern for distributing threads

* use recommended way GGML_TENSOR_LOCALS

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Plamen Minev <pacominev@gmail.com>
Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <airdldl@163.com>
2024-11-15 15:21:04 +02:00
31c3482a4e metal : add BF16 support (llama/8439)
* ggml : add initial BF16 support

ggml-ci

* metal : add mul_mat_id BF16 support

ggml-ci

* metal : check for bfloat support on the Metal device

ggml-ci

* metal : better var names [no ci]

* metal : do not build bfloat kernels when not supported

ggml-ci

* metal : try to fix BF16 support check

ggml-ci

* metal : this should correctly check bfloat support
2024-11-15 15:21:04 +02:00
50257af686 metal : fix from ptr buffer name (llama/10189) 2024-11-15 15:21:04 +02:00
d111a0987e ggml : adjust is_first_call init value (llama/10193)
ggml-ci
2024-11-15 15:21:04 +02:00
915bcd2c63 metal : add quantized FA support (llama/10149)
* metal : add quantized FA (vec) support

ggml-ci

* metal : add quantized FA (non-vec) support

* metal : fix support check

ggml-ci

* metal : clean-up

* metal : clean-up (cont)

* metal : fix shared memory calc + reduce smem + comments

* metal : float-correctness

* metal : minor [no ci]
2024-11-15 15:21:04 +02:00
f69c8b6f1b ggml : fix arch check in bf16_to_fp32 (llama/10164) 2024-11-15 15:21:04 +02:00
Eve
8c9044bef0 Q6_K AVX improvements (llama/10118)
* q6_k instruction reordering attempt

* better subtract method

* should be theoretically faster

small improvement with shuffle lut, likely because all loads are already done at that stage

* optimize bit fiddling

* handle -32 offset separately. bsums exists for a reason!

* use shift

* Update ggml-quants.c

* have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86
2024-11-15 15:21:04 +02:00
5f8e928194 ggml : fix gelu tables initialization (llama/10172) 2024-11-15 15:21:04 +02:00
25da30bd60 ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (llama/10167) 2024-11-15 15:21:04 +02:00
542734100e fix build break on arm64 linux (llama/10166)
This fixes the build break from the recent changes
to move the CPU backend to separate files
https://github.com/ggerganov/llama.cpp/pull/10144
2024-11-15 15:21:04 +02:00
b06b4c0c08 cuda : clear error after changing peer access (llama/10153) 2024-11-15 15:21:04 +02:00
939d36fb4c metal : simplify f16 and f32 dequant kernels (llama/0) 2024-11-15 15:21:04 +02:00
1471e41180 metal : move dequantize templates to beginning of MSL source (llama/0) 2024-11-15 15:21:04 +02:00
35949192e9 CANN: adjust backend registry refactor. (llama/10158)
remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.
2024-11-15 15:21:04 +02:00
9c817edb48 ggml : move CPU backend to a separate file (llama/10144) 2024-11-15 15:21:04 +02:00
24a0feb5d9 metal : minor fixup in FA kernel (llama/10143)
* metal : minor fixup in FA kernel

ggml-ci

* metal : use the unrolled loop variable

* metal : remove unused var
2024-11-15 15:21:04 +02:00
2ab8cce7e3 llama : add simple-chat example (llama/10124)
* llama : add simple-chat example

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-11-15 15:21:04 +02:00
b40c255e98 llama : use smart pointers for ggml resources (llama/10117) 2024-11-15 15:21:04 +02:00
ec3e16445e vulkan : improve ggml_vk_create_buffer error handling (llama/9898) 2024-11-15 15:21:04 +02:00
0665168ef3 ggml : remove ggml_scratch (llama/10121)
ggml-ci
2024-11-15 15:21:04 +02:00
5f6b992eea build: fix build error in Windows env with OneAPI setup (llama/10107) 2024-11-15 15:21:04 +02:00
3e231ab9cc llama : fix buffer checks for mamba and rwk (llama/10111)
* llama : fix buffer checks for mamba and rwk

* llama : fix missing worst case flag during reserve

* cuda : fix supports_op for norm

* disable sched SET_CAUSE
2024-11-15 15:21:04 +02:00
371bfaca8c ggml : check tensor name lengths in gguf files (llama/10100) 2024-11-15 15:21:04 +02:00
91e30a3a23 kompute: add mul_mat_q4_k shader (llama/10097)
This is a more or less direct translation from the Metal implementation
to GLSL.

Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-11-15 15:21:04 +02:00
1e122d66f9 kompute: add backend registry / device interfaces (llama/10045)
Get in line with the other backends by supporting the newer
backend/device registry interfaces.

Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-11-15 15:21:04 +02:00
63a4e09a0f ggml : fix memory leaks when loading invalid gguf files (llama/10094)
* ggml : fix gguf string leak when reading kv pairs fails

* ggml : avoid crashing with GGML_ABORT when the KV has an invalid type

* ggml : avoid crashing on failed memory allocations when loading a gguf file
2024-11-15 15:21:04 +02:00
75dd198870 ggml : add Q4_0_8_8 RISC-V GEMV and GEMM kernels (llama/10029)
* ggml : RISC-V vector gemv for q4_0_8x8

* ggml : Added WIP rvv q4_0_8x8 gemm

* ggml : Added initial implementation of rvv gemm

* ggml : optimize gemm to avoid register spillover

* ggml : Fix GCC rvv load alignment issue

* ggml : Format gemm rvv code

* ggml : Fix a typo in RVV q4_0_8_8 GEMM
2024-11-15 15:21:04 +02:00
1d48457aa6 llama : refactor model loader with backend registry (llama/10026) 2024-11-15 15:21:04 +02:00
307712a903 ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the MobileVLM model. (llama/9763)
* ggml: Add POOL2D OP for GPU ACC to the Vulkan.

- The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend.
- A GGML_OP_POOL_2D shader has been added. (Pooling)
- The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* [fix] Correct the incorrect order of the parameters.

fix casting to int.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

---------

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
2024-11-15 15:21:04 +02:00
fbc9a05ddf musa: workaround for Guilty Lockup in cleaning src0 (llama/10042)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-11-15 15:21:04 +02:00
28496ac55e cmake : make it possible linking ggml as external lib (ggml/1003) 2024-11-15 15:21:04 +02:00
b1c06c09b0 metal : fix minor string leaks (ggml/1004) 2024-11-15 15:21:04 +02:00
498ac0dc27 scripts : update sync 2024-11-15 15:21:04 +02:00
03af461de8 ci : fix building workflow for linux/arm64 container (#2555) 2024-11-15 11:07:17 +02:00
f19463ece2 ruby : extend API (#2551)
* Handle objs in Ruby code

* Add task to make Makefile

* Share commont constance in test suites

* Add model-related APIs

* Add Whisper::Model class

* Add tests for Whisper::Model

* Add missing LDFLAG -lstdc++

* Add tests for Whisper.log_set

* Add Whisper.set_log

* Define log level

* Add document on logging

* Add license section to README

* Add document on Whisper::Model

* Fix examples in README

* Add test for Model with GC

* Make dependency on Makefile more accurate

* Fix bug about Whisper::Model and GC
2024-11-13 21:52:56 +02:00
5f8a086e22 whisper.swiftui : add model download list & bench methods (#2546)
* swift : fix resources & exclude build

* whisper : impl whisper_timings struct & api

* whisper.swiftui : model list & bench methods

* whisper : return ptr for whisper_get_timings

* revert unnecessary change

* whisper : avoid designated initializer

* whisper.swiftui: code style changes

* whisper.swiftui : get device name / os from UIDevice

* whisper.swiftui : fix UIDevice usage

* whisper.swiftui : add memcpy and ggml_mul_mat (commented)
2024-11-13 21:51:34 +02:00
a28d82e373 ruby : fix the instructions (#2548)
#prompt doesn't exist but #initial_prompt does
2024-11-13 21:47:42 +02:00
5ccca19f0c ggml : vulkan logs (#2547) 2024-11-13 21:47:15 +02:00
300c07b94d examples : fix ffmpeg v5 build (#2543)
remove call to 'av_register_all()' which does not exist in ffmpeg v5
anymore.
2024-11-13 21:41:52 +02:00
31aea563a8 whisper : fix extra memory usage (#2534)
* passing samples_padded by ref to the threads.

* passing samples_padded by ref to the threads.

---------

Co-authored-by: Vinith Misra <physicsdemon@gmail.com>
2024-11-06 23:02:11 +02:00
0377596b77 whisper : backend registry init before model load 2024-11-01 10:19:05 +02:00
c65d0fd3c8 talk-llama : sync llama.cpp 2024-11-01 10:19:05 +02:00
d9efb664ac sync : ggml 2024-11-01 10:19:05 +02:00
b5b4b0f5de ggml : add AMX backend (llama/8998) 2024-11-01 10:19:05 +02:00
ab36d02560 metal : support permuted matrix multiplicaions (llama/10033)
* metal : support permuted matrix multiplicaions

ggml-ci

* cont : use nb01 directly for row steps

ggml-ci

* cont : add comments [no ci]

* metal : minor refactor

* metal : minor
2024-11-01 10:19:05 +02:00
6e67749c00 CUDA: fix insufficient buffer clearing for MMQ (llama/10032) 2024-11-01 10:19:05 +02:00
ab0385f43b CUDA: fix MMQ for non-contiguous src0, add tests (llama/10021)
* CUDA: fix MMQ for non-contiguous src0, add tests

* revise test code
2024-11-01 10:19:05 +02:00
10eb603a3c increase cuda_cpy block size (ggml/996)
Co-authored-by: bssrdf <bssrdf@gmail.com>
2024-11-01 10:19:05 +02:00
a3231b2f2e metal : add POOL2D and fix IM2COL (llama/9943)
* add pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix im2col and add unittest for N>=1024

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* add tests for N % 1024 != 0

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* remove trailing whitespaces

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply suggestions

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply more optimization

- original IM2COL kernel + _ext with MIN()

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review: change kernel name of pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix more formatting and enhance readability

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

---------

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
2024-11-01 10:19:05 +02:00
13db492f83 Adapt to dynamically loadable backends mechanism (llama/9970)
* [CANN] Adapt to dynamically loadable backends mechanism

* Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class

* Handle the review comments of this pull request
2024-11-01 10:19:05 +02:00
741c138aa1 ggml : add asserts for type conversion in fattn kernels (llama/9971)
ggml-ci
2024-11-01 10:19:05 +02:00
25f9fee6fb rpc : pack only RPC structs (llama/9959) 2024-11-01 10:19:05 +02:00
7c1570bee6 fix mul_mat_vec_q and *_vec_q error (llama/9939)
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-11-01 10:19:05 +02:00
4078e4c388 rpc : backend refactoring (llama/9912)
* rpc : refactor backend

Use structs for RPC request/response messages

* rpc : refactor server
2024-11-01 10:19:05 +02:00
a4a22daa8f Add SYCL Backend registry, device and Event Interfaces (llama/9705)
* implemented missing SYCL event APIs

* sycl : Added device and backend reg interfaces

* Restructured ggml-sycl.cpp
2024-11-01 10:19:05 +02:00
e1936eb2a5 add amx kernel for gemm (llama/8998)
add intel amx isa detection

add vnni kernel for gemv cases

add vnni and amx kernel support for block_q8_0

code cleanup

fix packing B issue

enable openmp

fine tune amx kernel

switch to aten parallel pattern

add error message for nested parallelism

code cleanup

add f16 support in ggml-amx

add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS

update CMakeList

update README

fix some compilation warning

fix compiler warning when amx is not enabled

minor change

ggml-ci

move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp

ggml-ci

update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16

ggml-ci

add amx as an ggml-backend

update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h

minor change

update CMakeLists.txt

minor change

apply weight prepacking in set_tensor method in ggml-backend

fix compile error

ggml-ci

minor change

ggml-ci

update CMakeLists.txt

ggml-ci

add march dependency

minor change

ggml-ci

change ggml_backend_buffer_is_host to return false for amx backend

ggml-ci

fix supports_op

use device reg for AMX backend

ggml-ci

minor change

ggml-ci

minor change

fix rebase

set .buffer_from_host_ptr to be false for AMX backend
2024-11-01 10:19:05 +02:00
28b044dad9 vulkan : add backend registry / device interfaces (llama/9721)
* vulkan : add backend registry / device interfaces

* llama : print devices used on model load
2024-11-01 10:19:05 +02:00
b8f11a0a17 fix: allocating CPU buffer with size 0 (llama/9917) 2024-11-01 10:19:05 +02:00
ff5a838099 fix: use vm_allocate to allocate CPU backend buffer on macOS (llama/9875)
* fix: use `vm_allocate` to allocate CPU backend buffer on macOS

* fix: switch to `posix_memalign` to keep existing `free()` usages work

* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS

* style: formatting

* fix: move const outside of `#ifndef`

* style: formatting

* fix: unused var

* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`

* fix: unused var

* fix: page align to `GGUF_DEFAULT_ALIGNMENT`

* fix: page align to `TENSOR_ALIGNMENT`

* fix: convert `TENSOR_ALIGNMENT` to a macro

* fix: increase page size to `32` on iOS

* fix: iOS page size

* fix: `hbw_posix_memalign` alignment
2024-11-01 10:19:05 +02:00
84713613be CUDA: fix 1D im2col, add tests (ggml/993) 2024-11-01 10:19:05 +02:00
ded89c9d08 Fix cann compilation error (llama/9891)
Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.
2024-11-01 10:19:05 +02:00
042e95d92f Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816)
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-11-01 10:19:05 +02:00
81110c0174 ggml : move more prints to the ggml log system (llama/9839)
* ggml : move more prints to the ggml log system

* show BLAS OpenMP warnings in all builds using debug print
2024-11-01 10:19:05 +02:00
c313723860 rpc : add backend registry / device interfaces (llama/9812)
* rpc : add backend registry / device interfaces

* llama : add llama_supports_rpc API

* ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server
2024-11-01 10:19:05 +02:00
e69b2371e2 musa: add docker image support (llama/9685)
* mtgpu: add docker image support

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: enable docker workflow

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-11-01 10:19:05 +02:00
1531259b2c ggml : fix BLAS with unsupported types (llama/9775)
* ggml : do not use BLAS with types without to_float

* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies

* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits

it's not really internal if everybody uses it
2024-11-01 10:19:05 +02:00
44bc2767fd ggml : add backend registry / device interfaces to BLAS backend (llama/9752)
* ggml : add backend registry / device interfaces to BLAS backend

* fix mmap usage when using host buffers
2024-11-01 10:19:05 +02:00
bd7ace7adc Update building for Android (llama/9672)
* docs : clarify building Android on Termux

* docs : update building Android on Termux

* docs : add cross-compiling for Android

* cmake : link dl explicitly for Android
2024-11-01 10:19:05 +02:00
315364d7de ggml : add metal backend registry / device (llama/9713)
* ggml : add metal backend registry / device

ggml-ci

* metal : fix names [no ci]

* metal : global registry and device instances

ggml-ci

* cont : alternative initialization of global objects

ggml-ci

* llama : adapt to backend changes

ggml-ci

* fixes

* metal : fix indent

* metal : fix build when MTLGPUFamilyApple3 is not available

ggml-ci

* fix merge

* metal : avoid unnecessary singleton accesses

ggml-ci

* metal : minor fix [no ci]

* metal : g_state -> g_ggml_ctx_dev_main [no ci]

* metal : avoid reference of device context in the backend context

ggml-ci

* metal : minor [no ci]

* metal : fix maxTransferRate check

* metal : remove transfer rate stuff

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-11-01 10:19:05 +02:00
80753d4da8 metal : single allocation of encode_async block (llama/9747)
* Single allocation of encode_async block with non-ARC capture in ggml-metal.m

* Moving Block_release to the deallocation code

* Release encode block when re-setting encoding buffer count if needed

* Update ggml/src/ggml-metal.m

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-01 10:19:05 +02:00
8f9bdca4c4 ggml-alloc : remove buffer_id from leaf_alloc (ggml/987)
This commit removes the buffer_id field from the leaf_alloc struct.

The motivation for is that this field is only written to and never
read/used as far as I can tell. Each tensor_alloc has a buffer_id field
and this is what caused me to look into this more closely, to
understand what the buffer_id in leaf_alloc was used for.
2024-11-01 10:19:05 +02:00
4e10afb5a9 scripts : sync amx 2024-10-31 22:13:24 +02:00
aa037a60f3 ggml : alloc ggml_contexts on the heap (#2525)
* whisper : reduce ggml_context usage

* ggml : allocate contexts on the heap (v2)

* ggml : aligned malloc -> malloc
2024-10-31 22:00:09 +02:00
19dca2bb14 ci : fix openblas build (#2511)
* ci : fix openblas build

* cont : would this work?

* ci : I'm sorry, windows

* cont : disabled wrong build

* ci : fix openblas build with pkgconfiglite (#2517)

- choco install pkgconfiglite (vcpkg-pkgconf doesn't contain pkg-config executable?)
- vcpkg install openblas (otherwise it is not detected now)

---------

Co-authored-by: Tamotsu Takahashi <ttakah+github@gmail.com>
2024-10-30 12:58:26 +02:00
55e422109b scripts : add turbo-q8_0 to the benchmark 2024-10-29 19:37:24 +02:00
3f020fac9d whisper : minor compile warning 2024-10-29 19:30:26 +02:00
1626b73b03 whisper : move new-segment callback after DTW step (#2515) 2024-10-29 08:47:21 +02:00
850f7b19d3 ruby : fix installation test (#2519) 2024-10-29 08:45:37 +02:00
d4bc413505 ruby : add more APIs (#2518)
* Add test for built package existence

* Add more tests for Whisper::Params

* Add more Whisper::Params attributes

* Add tests for callbacks

* Add progress and abort callback features

* [skip ci] Add prompt usage in README

* Change prompt text in example
2024-10-28 19:23:23 +02:00
fc49ee4479 ruby : support new-segment callback (#2506)
* Add Params#new_segment_callback= method

* Add tests for Params#new_segment_callback=

* Group tests for #transcribe

* Don't use static for thread-safety

* Set new_segment_callback only when necessary

* Remove redundant check

* [skip ci] Add Ruby version README

* Revert "Group tests for #transcribe"

This reverts commit 71b65b00cc.

* Revert "Add tests for Params#new_segment_callback="

This reverts commit 81e6df3bab.

* Add test for Context#full_n_segments

* Add Context#full_n_segments

* Add tests for lang API

* Add lang API

* Add tests for Context#full_lang_id API

* Add Context#full_lang_id

* Add abnormal test cases for lang

* Raise appropriate errors from lang APIs

* Add tests for Context#full_get_segment_t{0,1} API

* Add Context#full_get_segment_t{0,1}

* Add tests for Context#full_get_segment_speaker_turn_next API

* Add Context#full_get_segment_speaker_turn_next

* Add tests for Context#full_get_segment_text

* Add Context#full_get_setgment_text

* Add tests for Params#new_segment_callback=

* Run new segment callback

* Split tests to multiple files

* Use container struct for new segment callback

* Add tests for Params#new_segment_callback_user_data=

* Add Whisper::Params#new_user_callback_user_data=

* Add GC-related test for new segment callback

* Protect new segment callback related structs from GC

* Add meaningful test for build

* Rename: new_segment_callback_user_data -> new_segment_callback_container

* Add tests for Whisper::Segment

* Add Whisper::Segment and Whisper::Context#each_segment

* Extract c_ruby_whisper_callback_container_allocate()

* Add test for Whisper::Params#on_new_segment

* Add Whisper::Params#on_new_egment

* Assign symbol IDs to variables

* Make extsources.yaml simpler

* Update README

* Add document comments

* Add test for calling Whisper::Params#on_new_segment multiple times

* Add file dependencies to GitHub actions config and .gitignore

* Add more files to ext/.gitignore
2024-10-28 15:43:27 +02:00
c0ea41f6b2 ruby : add Metal support (#2516) 2024-10-28 13:08:09 +02:00
0fbaac9c89 whisper : fix index overflow in token-level timestamp logic (#2505) 2024-10-23 15:14:03 +03:00
a5abfe6a90 readme : update links and make commands (#2489)
* Update links to headers in README.md

* Add link to Vulkan section in README.md

* Add "-j" for parallelism for "make" in README.md

* Update README.md
2024-10-17 13:25:18 +03:00
d3f7137cc9 ruby : fix bindings (#2484)
* Improve Rakefile

* Remove intermediate files

* Remove unnecessary manipulations from extconf.rb

* Add README and LINCENSE to source files

* Manage ext source files using YAML file

* Use extsources.yaml to include files into gem package file

* Add git-managed source files to build dependency

* Add test task

* Download model for test if not exists

* Add test for build

* Ignore gem package directory

* Enable GitHub action for Ruby binding

* Fix model name

* Build lib file for test

* Use extension for each platform

* Use extension for each platform on testing

* Move built lib file rather than copy

* Add intermediate files to clean targets
2024-10-16 18:44:04 +03:00
f7c99e49b3 readme : add Vulkan notice (#2488)
* Add Vulkan notice in README.md

* Fix formatting for Vulkan section in README.md

* Fix formatting in README.md
2024-10-16 18:43:26 +03:00
1d5752fa42 make : fix GGML_VULKAN=1 build (#2485) 2024-10-16 18:42:47 +03:00
b6049060dd whisper : add dtw preset for large-v3-turbo (#2481) 2024-10-15 21:00:21 +03:00
06a1da9daf convert : handle max_target_positions (#2477)
as needed eg for
https://huggingface.co/primeline/whisper-large-v3-turbo-german/blob/main/config.json
2024-10-14 10:46:33 +03:00
746d173592 readme : update the Quick Start section (#2475)
navigating into the directory
2024-10-14 10:44:57 +03:00
fdbfb460ed whisper : add OpenVINO init with state (#2464)
* Fixed OpenVino init on state

* Removed an empty line

* Fixed typo

* Replaced tabs with spaces

---------

Co-authored-by: Sandro Hanea <sandrohanea@users.noreply.github.com>
2024-10-08 20:08:00 +03:00
ebca09a3d1 release : v1.7.1 2024-10-07 13:06:48 +03:00
9f346d0084 vulkan : retry allocation with fallback flags (#2451)
Co-authored-by: Samuel Morris <samuel.morris@artlist.io>
2024-10-06 10:34:20 +03:00
6a94163b91 release : v1.7.0 2024-10-05 16:43:26 +03:00
8a35b58c4f scripts : bench v3-turbo 2024-10-05 16:22:53 +03:00
1789abca84 whisper : remove mel leftover constants (396089f) 2024-10-05 16:13:03 +03:00
847f94fdeb whisper : zero-out the KV cache upon clear (#2445) 2024-10-05 15:23:51 +03:00
6e40108a59 objc : fix build 2024-10-05 15:23:51 +03:00
1ba185f4af metal : zero-init buffer contexts (#0) 2024-10-05 15:23:51 +03:00
396089f3cf whisper : revert mel-related changes (#0)
too much extra logic and complexity for small benefit
2024-10-05 15:23:51 +03:00
941912467d whisper : adapt to latest ggml (skip) (#0) 2024-10-05 15:23:51 +03:00
0b1b094a67 ggml : fix typo in example usage ggml_gallocr_new (ggml/984) 2024-10-05 15:23:51 +03:00
40e52a76b9 ggml : fixes after sync (ggml/983)
ggml : remove test-backend-buffer

ggml : fix CUDA build warnings
2024-10-05 15:23:51 +03:00
cf977670e6 ggml-backend : add device and backend reg interfaces (llama/9707)
Also:

- metal : fix compute pass descriptor autorelease crash
- ggml-backend : add device description to CPU backend
- ggml: unify backend logging mechanism
2024-10-05 15:23:51 +03:00
df2c364de7 Fixed dequant precision issues in Q4_1 and Q5_1 (llama/9711) 2024-10-05 15:23:51 +03:00
1acfadb721 ggml-backend : add device and backend reg interfaces (llama/9707)
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-10-05 15:23:51 +03:00
ea642144d2 Initial cmake support of SYCL for AMD GPUs (llama/9658)
sycl: initial cmake support of SYCL for AMD GPUs
2024-10-05 15:23:51 +03:00
282a8654c4 vulkan : do not use tensor->extra (llama/9407)
* vulkan : do not use tensor->extra

This patch allows using the Vulkan backend with the RPC backend as
tensor->extra is no longer used.

Ref: #8536

* Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (llama/2)

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-10-05 15:23:51 +03:00
936cf3beb7 ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980) 2024-10-05 15:23:51 +03:00
bc92c2f8f0 ggml: refactor cross entropy loss CPU impl. (ggml/976) 2024-10-05 15:23:51 +03:00
f7d55e0614 scripts : sync ggml-backend.cpp 2024-10-05 15:23:51 +03:00
f62a546e03 whisper : fix excessive memory usage (#2443)
* whisper : fix KV cache allocation

* whisper : reduce memory overhead from unused input tensors
2024-10-05 12:36:40 +03:00
2944cb72d9 examples : update dr_wav.h to newer version (#2449) 2024-10-04 11:04:51 +03:00
ccc2547210 talk-llama : sync llama.cpp 2024-10-03 12:22:17 +03:00
162a455402 metal : reduce command encoding overhead (llama/9698) 2024-10-03 12:22:17 +03:00
ff2cb0811f sync : ggml 2024-10-03 12:22:17 +03:00
5e9d6baa48 test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974) 2024-10-03 12:22:17 +03:00
845f8d663e vulkan : mul_mat: fix UB with small warps (ggml/952)
When the device's warp size is less than 16,
it is possible for loadstride_a (mul_mm.comp:114)
and loadstride_b (mul_mm.comp:115) to be set to 0.
Because they are calculated as: the workgroup size,
multiplied by LOAD_VEC_* (which can be 1) and divided by 16.
And the workgroup size is set to be the same as the
warp/subgroup size.

The loadstride_* variables are used as increments in the
loops that populate the buffers used for the multiplication.

When they are 0 they cause an infinite loop.
But infinite loops without side-effects are UB and the
values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away.
As a consequence, the buffers are not populated and
the multiplication result is just a matrix with all elements
set to 0.

We prevent the UB by making sure that the workgroup size
will never be less than 16, even if our device has a
smaller warp size (e.g. 8).

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-10-03 12:22:17 +03:00
31fdf05fda ggml : fix ggml_cast (ggml/973) 2024-10-03 12:22:17 +03:00
0ac6666cd2 ggml: fix gradient allocation logic (ggml/966)
* ggml: fix gradient allocation logic

* gradient allocation in ggml_build_backward_expand

* fixup

* fix test-backend-ops grad

* suggestions by slaren

* fix test1.c

* fix legacy opt API

* fix test-grad0

* remove keep arg
2024-10-03 12:22:17 +03:00
6c91da80b8 ggml : define missing HWCAP flags (llama/9684)
ggml-ci

Co-authored-by: Willy Tarreau <w@1wt.eu>
2024-10-03 12:22:17 +03:00
c245168ba3 ggml : add run-time detection of neon, i8mm and sve (llama/9331)
* ggml: Added run-time detection of neon, i8mm and sve

Adds run-time detection of the Arm instructions set features
neon, i8mm and sve for Linux and Apple build targets.

* ggml: Extend feature detection to include non aarch64 Arm arch

* ggml: Move definition of ggml_arm_arch_features to the global data section
2024-10-03 12:22:17 +03:00
280fee8fa0 Enable use to the rebar feature to upload buffers to the device. (llama/9251) 2024-10-03 12:22:17 +03:00
78b4c1c25f mtgpu: enable VMM (llama/9597)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-10-03 12:22:17 +03:00
1edea2eb4b ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (llama/9217)
* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.

* fix for build errors

* remove prints from the low-level code

* Rebase to the latest upstream
2024-10-03 12:22:17 +03:00
96808786b7 cann: fix crash when llama-bench is running on multiple cann devices (llama/9627) 2024-10-03 12:22:17 +03:00
bb57ecb85e CUDA: remove bad assert (ggml/972) 2024-10-03 12:22:17 +03:00
abdb73c7cc vulkan : multithread pipeline creation (ggml/963) 2024-10-03 12:22:17 +03:00
391e548a43 vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961) 2024-10-03 12:22:17 +03:00
2a29afd4c6 vulkan : argsort barriers must be under uniform control flow (ggml/951)
a return before a barrier (that happens only in some threads in
a workgroup) leads to UB.
While the old code actually works on some devices,
it fails on some others (i.e. "smaller" GPUs).

BTW, I think it would be better to set specialization constants
when the graph is built, in that way the local workgroup
could be sized appropriately.
But it would take a lot of work.

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-10-03 12:22:17 +03:00
5963004ff9 ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969) 2024-10-03 12:22:17 +03:00
ede1718f6d server : ffmpeg overwrite leftover temp file (#2431)
* Remove possible leftover ffmpeg temp file from a previous failed conversion

* Revert "Remove possible leftover ffmpeg temp file from a previous failed conversion"

This reverts commit 00797403bd.

* Flag to force ffmpeg to overwrite output file if it exists
2024-10-02 15:06:40 +03:00
2ef717b293 whisper : add large-v3-turbo (#2440) 2024-10-01 15:57:06 +03:00
8feb375fbd tests : remove test-backend-ops (#2434) 2024-09-27 11:49:01 +03:00
69339af2d1 ci : disable failing CUDA and Java builds 2024-09-25 10:05:04 +03:00
0d2e2aed80 readme : fix references to download-ggml-model.sh (#2427)
The script itself has a hashbang indicating that it is a shell script,
but the README indicates that it must be executed with `bash`.

I checked the script itself, and it seems to be valid POSIX shell. I can
confirm that it works with busybox sh.

Clarify the reference on the README, so it is clear that bash is not
actually a dependency for this script.
2024-09-24 21:07:51 +03:00
451e9ee92c make : remove "talk" target until updated 2024-09-24 19:45:08 +03:00
1133ac98a8 ggml : add ggml-cpu-impl.h (skip) (#0) 2024-09-24 19:45:08 +03:00
76d27eec9a sync : ggml 2024-09-24 19:45:08 +03:00
fe18c29ab8 talk-llama : sync llama.cpp 2024-09-24 19:45:08 +03:00
234f9bd320 ggml : add AVX512DQ requirement for AVX512 builds (llama/9622) 2024-09-24 19:45:08 +03:00
3b183cfae7 log : add CONT level for continuing previous log entry (llama/9610) 2024-09-24 19:45:08 +03:00
02285dff81 threads: fix msvc build without openmp (llama/9615)
We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.
2024-09-24 19:45:08 +03:00
2fc1d20f9e cuda: add q8_0->f32 cpy operation (llama/9571)
llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.
2024-09-24 19:45:08 +03:00
08e8414f27 threads: improve ggml_barrier scaling with large number of threads (llama/9598)
Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.
2024-09-24 19:45:08 +03:00
05c6139625 ggml : AVX512 gemm for Q4_0_8_8 (llama/9532)
* AVX512 version of ggml_gemm_q4_0_8x8_q8_0

* Remove zero vector parameter passing

* Rename functions and rearrange order of macros

* Edit commments

* style : minor adjustments

* Update x to start from 0

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-24 19:45:08 +03:00
896c41ef30 metal : use F32 prec for K*Q in vec FA (llama/9595)
ggml-ci
2024-09-24 19:45:08 +03:00
c36ddc43c6 Revert "[SYCL] fallback mmvq (ggml/9088)" (llama/9579)
This reverts commit 50addec9a532a6518146ab837a85504850627316.
2024-09-24 19:45:08 +03:00
13f41af43e musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (llama/9526)
* mtgpu: add mp_21 support

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: enable unified memory

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-24 19:45:08 +03:00
3fc5306b82 Fix merge error in #9454 (llama/9589)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-24 19:45:08 +03:00
adf2474b10 CUDA: enable Gemma FA for HIP/Pascal (llama/9581) 2024-09-24 19:45:08 +03:00
008816a257 RWKV v6: RWKV_WKV op CUDA implementation (llama/9454)
* ggml: CUDA unary op EXP

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: rwkv_wkv op CUDA impl

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-24 19:45:08 +03:00
33e5a6612e ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (llama/9573) 2024-09-24 19:45:08 +03:00
f0a7d65b3d Update CUDA graph on scale change plus clear nodes/params (llama/9550)
* Avoid using saved CUDA graph if scale changes and reset nodes/params on update

Fixes https://github.com/ggerganov/llama.cpp/issues/9451

* clear before resize
2024-09-24 19:45:08 +03:00
54e5095765 examples : adapt to ggml.h changes (ggml/0)
ggml-ci
2024-09-24 19:45:08 +03:00
34291099fb ggml : refactoring (llama/#0)
- d6a04f87
- 23e0d70b
2024-09-24 19:45:08 +03:00
d245d7aec7 ggml : fix builds (llama/0)
ggml-ci
2024-09-24 19:45:08 +03:00
d661283e68 ggml : fix trailing whitespace (llama/0)
ggml-ci
2024-09-24 19:45:08 +03:00
c0761c95f5 CUDA: fix sum.cu compilation for CUDA < 11.7 (llama/9562) 2024-09-24 19:45:08 +03:00
138e20b697 ggml : fix n_threads_cur initialization with one thread (llama/9538)
* ggml : fix n_threads_cur initialization with one thread

* Update ggml/src/ggml.c

---------

Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
2024-09-24 19:45:08 +03:00
a8d9abfa22 threadpool : skip polling for unused threads (llama/9461)
* threadpool: skip polling for unused threads

Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1).
This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur).

n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written
from one thread and read from other threads (not a race conditions).

* threadpool: further simplify and improve ggml_barrier

Avoid using strict memory order while polling, yet make sure that all threads go through
full memory barrier (memory fence) on ggml_barrier entrace and exit.

* threads: add simple barrier test

This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead.

* threadpool: improve thread sync for new-graphs

Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order
to keep it efficient, once the new graph is detected we do full fence using read-modify-write
with strict memory order.

* threadpool: improve abort handling

Do not use threadpool->ec (exit code) to decide whether to exit the compute loop.
threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it.

Instead introduce atomic threadpool->abort flag used for this. This is consistent with
how we handle threadpool->stop or pause.

While at it add an explicit atomic_load for n_threads_cur for consistency.

* test-barrier: release threadpool before releasing the context

fixes use-after-free detected by gcc thread-sanitizer on x86-64
for some reason llvm sanitizer is not detecting this issue.
2024-09-24 19:45:08 +03:00
195afd6dc1 ggml : link MATH_LIBRARY not by its full path (llama/9339) 2024-09-24 19:45:08 +03:00
1fd78999e8 cmake : do not hide GGML options + rename option (llama/9465)
* cmake : do not hide GGML options

ggml-ci

* build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS

for consistency

ggml-ci
2024-09-24 19:45:08 +03:00
Eve
374e9e0c5e ggml : IQ4_NL sgemm + Q4_0 AVX optimization (llama/9422)
* squashed

readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
2024-09-24 19:45:08 +03:00
a2cb5b4183 metal : handle zero-sized allocs (llama/9466) 2024-09-24 19:45:08 +03:00
288ae5176e common : reimplement logging (llama/9418)
https://github.com/ggerganov/llama.cpp/pull/9418
2024-09-24 19:45:08 +03:00
d868122a5a cmake : correct order of sycl flags (llama/9497) 2024-09-24 19:45:08 +03:00
2ba25fb122 cmake : try to fix sycl+intel build (llama/9487) 2024-09-24 19:45:08 +03:00
4f4687cb74 ggml : ggml_type_name return "NONE" for invalid values (llama/9458)
When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.
2024-09-24 19:45:08 +03:00
66b00fad0d cmake : use list(APPEND ...) instead of set() + dedup linker (llama/9463)
* cmake : use list(APPEND ...) instead of set() + dedup linker

ggml-ci

* cmake : try fix sycl

* cmake : try to fix sycl 2

* cmake : fix sycl build (llama/9469)

* try fix sycl build

* use CMAKE_CXX_FLAGS as a string variable

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* one more CMAKE_CXX_FLAGS fix (llama/9471)

---------

Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>
2024-09-24 19:45:08 +03:00
c6cc8d16c3 cann: Add host buffer type for Ascend NPU (llama/9406)
* feat: Add host buffer type for Ascend NPU(CANN backend)

* fix some checking errors

* Add a few comments
2024-09-24 19:45:08 +03:00
3f8f8a78a2 riscv : modify Makefile and add a RISCV_VECT to print log info (llama/9442)
- Added ggml_cpu_has_riscv_v() in GGML to print system info in log
- Modified Makefile to only use flag when cross compiling for RISC-V
2024-09-24 19:45:08 +03:00
3e47686919 cann: Fix error when running a non-exist op (llama/9424) 2024-09-24 19:45:08 +03:00
a53b69a003 CUDA: fix --split-mode row race condition (llama/9413) 2024-09-24 19:45:08 +03:00
d1c9b47360 musa: remove Clang builtins mapping (llama/9421)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-24 19:45:08 +03:00
32f659861a sycl : update support conditions (llama/9394)
* sycl : update support condition to im2col

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* Added TODO to remind supporting FP32 im2col

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
2024-09-24 19:45:08 +03:00
a785232bf9 metal : fix compile warning with GGML_METAL_NDEBUG (llama/0) 2024-09-24 19:45:08 +03:00
0677293503 rpc : fix segfault with nkvo (llama/9389)
* rpc : fix nkvo

* rpc : buf_size must not be static

ref: #9337

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-09-24 19:45:08 +03:00
1fbdb813c0 ggml : vector length agnostic SVE support (llama/9290)
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Removed WhiteSpaces

* ggml : style changes + fix 512-bit nb loop check

- fix local scope in switch cases
- consistent predicate names
- empty lines when necessary
- opening braces, spaces
- const-correctness
- add asserts

* Update ggml/src/ggml-quants.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-24 19:45:08 +03:00
67725ac8f3 CUDA: fix variable name conflict for Windows build (llama/9382) 2024-09-24 19:45:08 +03:00
dac89af357 Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. (llama/9118)
* Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early.

* fix compile issues

* Fix issues where the last submit wasn't executed or handled properly.

* remove trailing whitespace

* Repair GGML_VULKAN_CHECK_RESULTS

* Increase submit counter only if actual work has been submitted and increase submit count to 100.

* Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.
2024-09-24 19:45:08 +03:00
26225f1fb0 cuda : fix FA Q src index (1 -> 0) (llama/9374) 2024-09-24 19:45:08 +03:00
3468983315 add check malloc result on device (llama/9346)
* add check malloc result on device

* update for review comments, check all malloc_device() result

---------

Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-09-24 19:45:08 +03:00
c7515b0995 ggml/examples: add backend support for numerical optimization (ggml/949)
* CUDA eval works

* stochastic gradient descent op

* Adam except decay

* CUDA CROSS_ENTROPY_LOSS_BACK

* CUDA mnist-fc training works

* backend CLI arg

* refactor gguf load

* remove sched from opt_step_adam

* implement l1 regularization (weight decay)

* extra call to add optimizer

* initialize gradients with ggml_graph_reset

* gradient accumulation

* increment iter per eval instead of epoch

* adjust backend interfaces

* fix ggml_graph_reset without backend

* fix ggml graph export/import

* fixup

* rename

* revert ggml_opt changes

* more general CUDA repeat_back

* update documentation, fix CNN

* validation split

* add clarifying comment

* optimize PyTorch training

* adjust buffer size, thread count

* fix 0.0f validation split

* Update examples/mnist/mnist-common.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix gradient accumulation

* tensor flag for accumulators -> tensor hash set

* Update include/ggml.h

Co-authored-by: slaren <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* fix test prints

* Update src/ggml-backend.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* better CUDA support for noncontiguous out_prod

* add comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-09-24 19:45:08 +03:00
253ce30004 examples : add null threadpool args where needed (ggml/0)
ggml-ci
2024-09-24 19:45:08 +03:00
03a6fae484 metal : update support condition for im2col + fix warning (llama/0) 2024-09-24 19:45:08 +03:00
d37fd275fd ggml : always check bounds on get_rows operations (llama/9354) 2024-09-24 19:45:08 +03:00
195877fd72 ggml : fix missing cpu_set_t on emscripten (llama/9336)
* ggml : fix missing cpu_set_t on emscripten

* better version

* bring back android part
2024-09-24 19:45:08 +03:00
9e715e1b96 Improve Vulkan shader build system (llama/9239)
* Improve Vulkan shader builds system

- Add dependency to vulkan-shaders-gen to rebuild shaders when changing the shader compilation utility.
- Add option to generate debug info for Vulkan shaders to provide shader source to Vulkan shader profiling tools

* remove not required self dependency
2024-09-24 19:45:08 +03:00
6f5514b6e2 ggml-quants : ternary packing for TriLMs and BitNet b1.58 (llama/8151)
* ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b

* ggml-quants : faster 1.625 bpw AVX2 vec_dot

Not using a lookup table anymore makes it match q4_0 speed.

* gguf-py : fix formatting

* llama : remove spaces on empty line

* ggml-quants : subtract 1 when back in epi8

This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.

* ggml-quants : Q2_2 now faster than Q4_K on with AVX2

* ggml-quants : cleanup Q1_3 code formatting

* ggml-quants : ARM NEON vec_dot for q2_2 and q1_3

* ggml-quants : use ceiling division when quantizing q1_3

* convert-hf : simplify BitNet pre-quantization

This still results in the exact same tensor weights and scales,
but it reveals some weirdness in the current algorithm.

* convert-hf : allow converting the weird BitNet 1.3B

Its FFN size is 5460 which is not convenient.
The offending tensors are kept in F16,
which makes the final model 5.01 bpw.

* bitnet : replace 1.58b with b1.58, as in the paper

* ggml-quants : fix build failure on Windows

* ggml-quants : attempt to fix Arm 32-bit support

* ggml : add some informative comments in q1_3 vec_dot

* ggml : add TQ1_0 and TQ2_0 ternary quantization types

* ggml : even faster TQ2_0

* ggml : also faster TQ1_0

Same optimization as for TQ2_0 by offsetting the sum instead of the weights.
This makes TQ1_0 almost as fast as Q8_0 on AVX2.

* ggml : fix build issues in certain environments

* ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0

* ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat

The compiler seems smart enough to use the same instruction
even when using vget_high_s8 instead.

* ggml : remove q1_3 and q2_2

No more 1.625 bpw and 2.000 bpw,
now instead using 1.6875 bpw and 2.0625 bpw
with TQ1_0 and TQ2_0, respectively.

* llama : remove the separate scale tensors of BitNet b1.58

They won't be needed, since the remaining ternary quant types have
built-in scales.

* ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency

* ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot

Not yet tested on hardware which supports it,
might not work or might not even compile. But also it might.
It should make the performance better on recent ARM CPUs.

* ggml-quants : remove comment about possible format change of TQ2_0

Making it slightly more convenient for AVX512
but less convenient for everything else is not worth the trouble.

* gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0

* ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0

This does not change anything for ternary models,
since their values should never end up being in halfway cases anyway.

* convert : allow direct conversion to TQ1_0 and TQ2_0

The token embeddings and output tensors are kept in F16
to allow quantizing them to Q4_K and Q6_K with llama-quantize.

* llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0

Q4_0 is not completely symmetric (so not lossless for ternary models),
but it should be good enough.

* ggml-quants : allow using ARM dot product instructions for TQ1_0

* ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support

* ggml : remove unused ggml_mul special case

It would otherwise conflict with the more general
optimization coming with Mamba-2.

* ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators

* test-backend-ops : add TQ1_0 and TQ2_0 comments for later

Not yet adding uncommented, because some backends like SYCL and Metal
do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT.
(and Metal also doesn't handle it with GGML_OP_GET_ROWS)
Support for TQ1_0 and TQ2_0 for other backends than CPU
will be added in follow-up pull requests.
2024-09-24 19:45:08 +03:00
709a22b92d cuda : fix defrag with quantized KV (llama/9319) 2024-09-24 19:45:08 +03:00
01e214a1d7 ggml : AVX2 support for Q4_0_8_8 (llama/8713)
* Add AVX2 based implementations for quantize_q8_0_4x8, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions

* Update code to fix issues occuring due to non alignment of elements to be processed as multiple of 16 in MSVC

* Update comments and indentation

* Make updates to reduce number of load instructions
2024-09-24 19:45:08 +03:00
1cecfe6a02 Fix DMMV dequantization (llama/9279)
Fixed dmmv dequant for ncols== GGML_SYCL_DMMV_X
2024-09-24 19:45:08 +03:00
3764bc974c ggml : add pthread includes on FreeBSD (llama/9258) 2024-09-24 19:45:08 +03:00
fcffc912a9 llama : support RWKV v6 models (llama/8980)
* convert_hf_to_gguf: Add support for RWKV v6

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add RWKV tokenization

* Fix build

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Do not use special tokens when matching in RWKV tokenizer

* Fix model loading

* Add (broken) placeholder graph builder for RWKV

* Add workaround for kv cache

* Add logits conversion to rwkv5

* Add rwkv5 layer norms

* Add time mix KVRG & correct merge mistake

* Add remaining time mix parameters

* Add time mix output loading

* Add placeholder llm_build_time_mix

* Fix build

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Load more tensors for rwkv v6

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix rwkv tokenizer

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: Add unary operator Exp

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV v6 graph building

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add ``rescale_every_n_layers`` parameter

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add ``wkv.head_size`` key for RWKV

so it doesn't reuse Mamba ssm parameters

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix offloading layers to CUDA

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix parallel inferencing for RWKV

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Remove trailing whitespaces

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* build_rwkv: Avoid using inplace operations

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* convert_hf_to_gguf: rwkv: Avoid using ``eval``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* convert_hf_to_gguf: rwkv tokenizer: Don't escape sequences manually

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* ggml: Add backward computation for unary op ``exp``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* Use MODEL_ARCH.RWKV6 instead of MODEL_ARCH.RWKV

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* build_rwkv6: Simplify graph

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Detect model.type

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Fix tensor loading for 7B/14B models

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Fix group_norm assertion failure with Metal

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Clean up

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Add quantization tensor exclusion

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Use the new advanced batch splits

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* llama: rwkv6: Use ``ggml_norm`` instead of ``ggml_group_norm``

Co-authored-by: compilade <git@compilade.net>

* llama: rwkv6: Apply code style and misc changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* converter: Use class name ``Rwkv6Model``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Make use of key ``feed_forward_length``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Add kv ``time_mix_extra_dim`` and ``time_decay_extra_dim``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* converter: Match ``new_name`` instead of ``name`` for float32 explicit tensors

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Keep ``time_mix_w1/w2`` as F32

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Remove unused nodes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Apply code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Add lora for some supported tensors

Currently att.key/receptance/value/gate/output, ffn.receptance/key/value, as well as head.weight

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* rwkv : speed-up tokenization using trie

* minor : style + indentation

* llama: rwkv6: Avoid division by zero

Co-authored-by: compilade <git@compilade.net>

* ggml: rwkv_wkv: Avoid copying the state

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Layl Bongers <3094382+LaylBongers@users.noreply.github.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-24 19:45:08 +03:00
38d40b9972 Threadpool: take 2 (llama/8672)
* Introduce ggml_compute_threadpool

- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems

* Minor fixes

* fixed use after release bug

* fixed a harmless race condition

* Fix Android bulid issue

* fix more race conditions

* fix deadlock for cases where cgraph.n_nodes == 1

and fix --poll case

* threadpool: use cpu_get_num_math to set the default number of threadpool threads

This way we avoid using E-Cores and Hyperthreaded siblings.

* bench: create fresh threadpool for each test

For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).

* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier

This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.

* threadpool: make polling the default to match openmp behavior

All command line args now allow for setting poll to 0 (false).

* threadpool: do not wakeup threads in already paused threadpool

* fix potential race condition in check_for_work

* threadpool: do not create two threadpools if their params are identical

* threadpool: reduce pause/resume/wakeup overhead in common cases

We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.

* threadpool: add support for hybrid polling

poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...

The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.

* threadpool: reduce the number of barrier required

New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.

* threadpool: remove special-casing for disposable threadpools

With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.

Include n_threads in debug print for disposable threadpool.

Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.

* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)

This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.

* threadpool: use relaxed order for chunk sync

Full memory barrier is an overkill for this since each thread works on different chunk

* threadpool: remove abort_callback from threadpool state

* threadpool: better naming for thread/cpumask releated functions

* threadpool: consistent use of int type for n_threads params

* threadpool: add support for ggml_threadpool_params_default/init

Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.

* threadpool: move typedef into ggml.h

* threadpool: fix apply_priority() function name

* threadpool: fix swift wrapper errors due to n_threads int type cleanup

* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled

* threadpool: replace checks for compute_thread ret code with proper status check

* threadpool: simplify threadpool init logic and fix main thread affinity application

Most of the init code is now exactly the same between threadpool and openmp.

* threadpool: update threadpool resume/pause function names

* threadpool: enable openmp by default for now

* threadpool: don't forget to free workers state when omp is enabled

* threadpool: avoid updating process priority on the platforms that do not require it

On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.

* threadpool: update calling thread prio and affinity only at start/resume

This avoids extra syscalls for each graph_compute()

* llama-bench: turn threadpool params into vectors, add output headers, etc

* llama-bench: add support for cool off between tests --delay

This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.

* threadpool: move process priority setting into the apps (bench and cli)

This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.

* threadpool: move all pause/resume logic into ggml

* threadpool: futher api cleanup and prep for future refactoring

All threadpool related functions and structs use ggml_threadpool prefix.

* threadpool: minor indent fixes

* threadpool: improve setprioty error message

* Update examples/llama-bench/llama-bench.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* threadpool: fix indent in set_threadpool call

* use int32_t for n_thread type in public llama.cpp API

* threadpool: use _new and _free instead of _create and _release

* fix two more public APIs to use int32_t for n_threads

* build: set _GNU_SOURCE for Adroid

---------

Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-09-24 19:45:08 +03:00
09149ee0ae vulkan: fix compilation with GGML_VULKAN_DEBUG=ON (ggml/948)
the old code was trying to print a non-existent field (size)
and the struct as a whole (which doesn't have a operator<<
override defined).
Probably a typo happened during refactoring.

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-09-24 19:45:08 +03:00
6b7f37dd5c vulkan: add dryrun support to sin and cos ops (ggml/947)
sin and cos failed test-backend-ops because they
tried to dereference a context pointer that is null
on dry runs.

This commit prevents that segfault.

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-09-24 19:45:08 +03:00
791812fb54 vulkan: correctly report support for OP_CONT (ggml/946)
test-backend-ops fails because ggml_cont aborts
when invoked passing an unsupported type.

This commit makes ggml_cont tests pass

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-09-24 19:45:08 +03:00
5d6dc19f04 tests: add gradient tests for all backends (ggml/932)
* tests: add gradient checking to test-backend-ops

* remove old comment

* reorder includes

* adjust SIN/COS parameters

* add documentation, use supports_op if possible
2024-09-24 19:45:08 +03:00
34972dbe22 go : add temperature options (#2417)
* Fixed go cuda bindings building

* Added note to go bindings Readme to build using cuda support

* Added temperature bindings for Go

---------

Co-authored-by: Binozo <entwickler@binozoworks.de>
2024-09-20 15:45:36 +03:00
bea43e0c64 docker : add libsdl2-dev for container builds (#2424)
* Added libsdl2-dev for SDL samples

Building talk-llama seems to fail here as there is no sdl.h.

* Adding libsdl2-dev for sdl.h

* Adding libsdl2-dev for sdl.h
2024-09-20 15:36:43 +03:00
3853d83d73 go : add tests and update bindings (#2425)
Update Go version to ^1.23, the actions/setup-go
to v5, actions/checkout to v4 and
github.com/stretchr/testify to v1.9.0.

Add test cases for the following model
struct methods:
 - New
 - Close
 - NewContext
 - IsMultilingual
 - Languages

Add test cases for the following context
struct methods:
 - SetLanguage
 - IsMultilingual
 - Language
 - Process
2024-09-20 15:36:12 +03:00
5b1ce40fa8 server : use OS-generated temp file name for converted files (#2419) 2024-09-17 15:56:32 +03:00
049b3a0e53 go : fix CUDA build (#2416)
* Fixed go cuda bindings building

* Added note to go bindings Readme to build using cuda support

---------

Co-authored-by: Binozo <entwickler@binozoworks.de>
2024-09-15 12:23:56 +03:00
a551933542 cann : add Ascend NPU instructions (#2410) 2024-09-11 15:59:24 +03:00
5caa19240d cmake: Fix libdir value in pkgconfig file (#2407)
Depending on the OS the lib dir can vary, on Fedora for instance it is
"${prefix}/lib64". Instead of hard-coding the directory name, let CMake fill
this variable for us.
2024-09-07 11:18:17 +03:00
5236f02784 revert : cmake : set MSVC to use UTF-8 on source files (#2346)
This reverts commit c96906d84d.
2024-09-02 15:24:50 +03:00
2abaf19e0d sync : ggml 2024-09-02 15:24:50 +03:00
6eb7a0ffbd ggml: fix ggml_graph_cpy undefined behavior (ggml/943) 2024-09-02 15:24:50 +03:00
e8f0f9b5f0 cann : fix doxy (ggml/0) 2024-09-02 15:24:50 +03:00
d8e24b877d vulkan : fix build (llama/0)
ggml-ci
2024-09-02 15:24:50 +03:00
cc68f31577 cuda : mark BF16 CONT as unsupported 2024-09-02 15:24:50 +03:00
4a4a52bf98 ggml : fix cont with transposed tensors when one dimension is 1 (ggml/934)
* ggml_cont: fix issue with transposed tensors when one dimension is 1

when using multiple threads, it is not enough
to check for the tensors to be contiguous for
ggml_compute_forward_dup_same_cont to work correctly.
The tensors strides also need to match.

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

* Add ggml_cont tests

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

* Remove dead code

it isn't possible to reach this code because
all these functions are invoked by ggml_compute_forward_dup
if and only if src0->type != dst->type

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

* Make ggml_compute_forward_dup_same_cont work with contiguous tensors

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

---------

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-02 15:24:50 +03:00
c96906d84d cmake : set MSVC to use UTF-8 on source files (#2346) 2024-08-30 14:04:04 +03:00
9600fc3eb1 readme : remove invalid flag from Python example (#2396)
* Update README.md

Fix broken C-style API link

* Update whisper_processor.py

Update examples/python/whisper_processor.py to remove nonexistent flag "-np" from subprocess.Popen call.

* Add pywhispercpp to the Pybind11 Python wrapper list

abdeladim-s/pywhispercpp wasn't added to the list / was removed at some point (?)

It was referenced in issue #9, so I feel like it's worthy of being added as it's the first if not one of the first Python wrappers for whisper.cpp
2024-08-30 14:00:38 +03:00
e2e55a6fed readme : fix link (#2394) 2024-08-30 13:58:22 +03:00
c4e1861d2c go : add beamsize/entropythold/maxcontext to context interface (#2350)
* feat(go binding): add beamsize/entropythold/maxcontext to context interface

fixes: #2349

* fix go building build

* fix dynamic link .so and header.h

* remove LD_LIBRARY_PATH

* remove ggml obj from whisper dynamic lib

* drop LIB_GGML
2024-08-28 17:09:01 +03:00
da9809f243 talk-llama : sync llama.cpp 2024-08-28 13:22:20 +03:00
9d754a56cf whisper : update FA call 2024-08-28 13:22:20 +03:00
8cc90a0e80 sync : ggml 2024-08-28 13:22:20 +03:00
82b5c56f63 sync : vulkan (skip) (llama/0) 2024-08-28 13:22:20 +03:00
b2ad484c89 ggml : do not crash when quantizing q4_x_x with an imatrix (llama/9192) 2024-08-28 13:22:20 +03:00
d96a17848f metal : separate scale and mask from QKT in FA kernel (llama/9189)
* metal : separate scale and mask from QKT in FA kernel

* metal : ne01 check no longer necessary

* metal : keep data in local memory
2024-08-28 13:22:20 +03:00
0e7798677a ggml : add SSM Metal kernels (llama/8546)
* ggml : add ggml_ssm_conv metal impl

* ggml : add ssm_scan metal impl

ggml-ci
2024-08-28 13:22:20 +03:00
58a36d2e3b metal : gemma2 flash attention support (llama/9159) 2024-08-28 13:22:20 +03:00
24d8534bd8 CPU/CUDA: Gemma 2 FlashAttention support (llama/8542)
* CPU/CUDA: Gemma 2 FlashAttention support

* apply logit_softcap to scale in kernel

* disable logit softcapping tests on Metal

* remove metal check
2024-08-28 13:22:20 +03:00
9b16ddd3a5 Add a space to supress a cmake warning (llama/9133) 2024-08-28 13:22:20 +03:00
32f88af17b Add oneDNN primitive support (llama/9091)
* add onednn

* add sycl_f16

* add dnnl stream

* add engine map

* use dnnl for intel only

* use fp16fp16fp16

* update doc
2024-08-28 13:22:20 +03:00
9bf7250bf9 llama : simplify Mamba with advanced batch splits (llama/8526)
* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-28 13:22:20 +03:00
17e49d3ab2 fallback mmvq (llama/9088)
* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>

---------

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>
2024-08-28 13:22:20 +03:00
58b725282a Fix SYCL im2col and convert Overflow with Large Dims (llama/9052)
* sycl: fix im2col overflow and sync with cuda

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix convert overflow

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix convert and dequantize

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix ib in dmmv

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl:refine convert

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: move downsample global_range into common

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: add im2col and convert test cases

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: make new cases only in sycl

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: comment new test_cases for only local testing

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

---------

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
2024-08-28 13:22:20 +03:00
7e59afa1e0 rpc : print error message when failed to connect endpoint (llama/9042) 2024-08-28 13:22:20 +03:00
5ac022140e rpc : prevent crashes on invalid input (llama/9040)
Add more checks which prevent RPC server from crashing if invalid input
is received from client
2024-08-28 13:22:20 +03:00
0eaa67280c ggml : dynamic ggml_sched_max_splits based on graph_size (llama/9047)
* ggml : Dynamic ggml_sched_max_splits based on graph_size

* Fixed and readded debug code for causes
2024-08-28 13:22:20 +03:00
5a62fdb735 cmake : remove unused option GGML_CURL (llama/9011) 2024-08-28 13:22:20 +03:00
60098d6204 ggml : move rope type enum to ggml.h (llama/8949)
* ggml : move rope type enum to ggml.h

This commit moves the `llama_rope_type` enum from `llama.h` to
`ggml.h` and changes its name to `ggml_rope_type`.

The motivation for this change is to address the TODO in `llama.h` and
use the enum in ggml.

Note: This commit does not change the `mode` parameter to be of type
`enum ggml_rope_type`. The name `mode` and its usage suggest that it
might be more generic and possibly used as a bit field for multiple
flags. Further investigation/discussion may be needed to determine
if `mode` should be restricted to RoPE types.

* squash! ggml : move rope type enum to ggml.h

This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from
ggml.h, and back the llama_rope_type enum.

I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is
safe to remove it yet.

* squash! ggml : move rope type enum to ggml.h

This commit removes the enum ggml_rope_type from ggml.h and replaces it
with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to
check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has
been updated to reflect this change.

* squash! ggml : move rope type enum to ggml.h

This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX
macro/define to be passed to the shader compiler.

* squash! ggml : move rope type enum to ggml.h

This commit fixes the editorconfig-checker warnings.

* squash! ggml : move rope type enum to ggml.h

Update comment for ggml_rope function.

* Revert "squash! ggml : move rope type enum to ggml.h"

This reverts commit 6261222bd0dc0efd51f0fb0435ad3f16a5b52fd6.

* squash! ggml : move rope type enum to ggml.h

Add GGML_ROPE_TYPE_NEOX to rope_common.comp.

* remove extra line

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-08-28 13:22:20 +03:00
317293e6a7 ggml: fix div-by-zero (llama/9003)
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724

In order to access the above bug you need to login using one of the
emails in
https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5

Signed-off-by: David Korczynski <david@adalogics.com>
2024-08-28 13:22:20 +03:00
488a966c07 Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. (llama/8943)
* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.

- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.

* Fix small typo

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-08-28 13:22:20 +03:00
8954769aa2 feat: ref. cross entropy, add CUDA, fix grad test (ggml/929) 2024-08-28 13:22:20 +03:00
df06468d9e ggml: remove bad assert (ggml/928) 2024-08-28 13:22:20 +03:00
1fbd828a5d examples: add MNIST training + missing ops 2024-08-28 13:22:20 +03:00
d2986f8b07 models : add support for wget2 for fedora (#2387) 2024-08-28 11:46:01 +03:00
8bfa8574e2 readme : update the path to bench.py (#2386) 2024-08-28 11:45:05 +03:00
376567bf4f readme : fix typo (#2383) 2024-08-28 11:42:18 +03:00
c0fd64a9c0 readme : fix broken links in implementation details section (#2382) 2024-08-28 11:41:51 +03:00
6e9596f6de whisper : fix compile warning for unused params 2024-08-28 11:40:11 +03:00
9e3c5345cd sync : ggml vulkan (ggml/0)
ggml-ci
2024-08-21 11:07:13 +03:00
b6c05ce82f yolo : add backend support (ggml/924)
* yolo : add backend support

* metal : add sub and sqrt kernels

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-21 11:07:13 +03:00
52c80cac00 ggml : fix typo in ggml-quants.c comment (ggml/922) 2024-08-21 11:07:13 +03:00
3643120690 feat: add new sin and cos operators (ggml/919)
* ggml : add sin/cos operators

* ggml-cuda : add sin/cos operators

* ggml : add corresponding tests for sin/cos

* ggml : add backward computation for sin/cos operators

* ggml-vulkan : add sin/cos operators

* ggml-vulkan : add sin/cos shader source

* metal : add sin, cos

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-21 11:07:13 +03:00
d65786ea54 readme : fix broken links (#2358)
For whisper.cpp and whisper.h files
2024-08-20 10:57:45 +03:00
7f78675008 examples : use colorblind friendly TTY color scheme (#2360)
This change updates the -pc flag, so that a new xterm256 color scheme is
used. This color scheme is believed to be better for three reasons:

1. It should be friendlier to the colorblind. The scheme was designed by
   Paul Tol (see: https://personal.sron.nl/~pault/). TensorBoard uses it
   since 2017, so it's already popular in the machine learning community

2. It should appear to be the same colors as before to people who aren't
   i.e. it's still a red-green spectrum like before but lightly modified

3. It is readable in both white and black background terminals. The neon
   colors before were probably a bit too intense for white backgrounds.
2024-08-20 10:49:10 +03:00
22fcd5fd11 sync : ggml 2024-08-12 11:59:15 +03:00
993f0df419 ggml : support forward pass broadcasting in ggml_sub (ggml/914)
* ggml: support forward pass broadcasting in ggml_sub

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

* Use assert instead of GGML_ASSERT in ggml_compute_forward_sub_f32

The check is already performed in ggml_sub_impl

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

---------

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-08-12 11:58:49 +03:00
9b1788483c metal : fix uninitialized abort_callback (llama/8968) 2024-08-12 11:58:49 +03:00
ad37d26983 rpc : sanitize tensor data + warnings (llama/0)
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-12 11:58:46 +03:00
81c999fe0a cann : add Ascend NPU support (#2336)
* enable Ascend NPU in src/whisper.cpp
  * sync test-backend-ops with llama.cpp
2024-08-09 15:21:56 +03:00
4b7de08bfd whisper : fix compile warning (#0) 2024-08-09 09:58:16 +03:00
4b9c4de1ad sync : ggml 2024-08-09 09:58:16 +03:00
be88ee1d75 ggml : add CANN backend (llama/0)
ggml-ci
2024-08-09 09:58:16 +03:00
3ab19c744e scripts : sync cann 2024-08-09 09:58:16 +03:00
6eac06759b ci : disable ruby workflow (#0) 2024-08-08 22:48:46 +03:00
2e9a5bd2c4 ci : try to fix FreeBSD (#0) 2024-08-08 22:48:46 +03:00
58323bf8ed build : fix aarch64 (#0) 2024-08-08 22:48:46 +03:00
22058f2dbc talk-llama : sync llama.cpp 2024-08-08 22:48:46 +03:00
5b7979a1e6 sync : ggml 2024-08-08 22:48:46 +03:00
ee14c02365 ggml-backend : fix async copy from CPU (llama/8897)
* ggml-backend : fix async copy from CPU

* cuda : more reliable async copy, fix stream used when the devices are the same
2024-08-08 22:48:46 +03:00
ab39dd34e1 Updated SYCL device filtering (llama/8901)
* Updated device filter to depend on default_selector (fixes non-intel device issues)
* Small related update to example/sycl Readme
2024-08-08 22:48:46 +03:00
b1348d3530 CUDA/HIP: fix tests/test-backend-ops (llama/8896) 2024-08-08 22:48:46 +03:00
90641b5cf4 CUDA: fix padding logic for FP16/FP32 (llama/8884) 2024-08-08 22:48:46 +03:00
4160b930f1 ggml : add epsilon as a parameter for group_norm (llama/8818)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-08-08 22:48:46 +03:00
7a96e661e4 ggml : fix overflows in elu function (llama/8866)
It's helpful to use expm1f(x), because expf(x)-1 will result in overflow
for 25% of single-precision floating point numbers.
2024-08-08 22:48:46 +03:00
a902fb4ab2 ggml : reading the runtime sve config of the cpu (llama/8709)
* ggml : reading the runtime sve config of the cpu

* change to one time init to prevent performance drop

* prefix variable to avoid possible conflicts

* revert xxhash fix and add brackets

---------

Co-authored-by: domke <673751-domke@users.noreply.gitlab.com>
2024-08-08 22:48:46 +03:00
6cb38c3673 Fix conversion of unnormalized BF16->BF16 weights (llama/7843)
* add truncate_bf16

* truncate intermediate fp32 if converting bf16 to bf16

* fix masking in __compute_fp32_to_bf16

* np.int16 no longer used

* missing cast and additional numpy 2.x fix

* ggml-impl : do not flush bf16 subnormals to zero

* ggml : add reference fp32 to bf16 conversion

The fast version is no longer equivalent for all platforms
because of the handling of subnormal values.

* gguf-py : remove flush to zero for bf16 subnormals

* gguf-py : remove float32 truncation to bf16

Rounding achieves the same thing in the cases where this was used.

* missed prototype update in merge

* merge cleanup

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-08-08 22:48:46 +03:00
9cf14ebcbc Fixing wrong VDR iq4nl value (llama/8812) 2024-08-08 22:48:46 +03:00
8e39ee171f ggml-cuda: Adding support for unified memory (llama/8035)
* Adding support for unified memory

* adding again the documentation about unified memory

* refactoring: Moved the unified memory code in the correct location.

* Fixed compilation error when using hipblas

* cleaning up the documentation

* Updating the documentation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* adding one more case where the PR should not be enabled

---------

Co-authored-by: matteo serva <matteo.serva@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-08-08 22:48:46 +03:00
d26250f78c Build: Only include execinfo.h on linux systems that support it (llama/8783)
* Only enable backtrace on GLIBC linux systems

* fix missing file from copy

* use glibc macro instead of defining a custom one
2024-08-08 22:48:46 +03:00
5218ea21b8 cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X (llama/8800)
* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X

* update asserts

* only use dmmv for supported types

* add test
2024-08-08 22:48:46 +03:00
e60be821ce added android implementation of ggml_print_backtrace_symbols (llama/8751)
* added android implementation of ggml_print_backtrace_symbols

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-08-08 22:48:46 +03:00
19708df884 cann: update cmake (llama/8765) 2024-08-08 22:48:46 +03:00
3f190addda Add TIMESTEP_EMBEDDING OP (llama/8707)
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
2024-08-08 22:48:46 +03:00
b355ee7cfa ggml: bugfix: fix the inactive elements is agnostic for risc-v vector (llama/8748)
In these codes, we want to retain the value that they previously held
when mask[i] is false. So we should use undisturbed. With the default
agnostic policy of rvv intrinsic, these values can be held or be
written with 1s.

Co-authored-by: carter.li <carter.li@starfivetech.com>
2024-08-08 22:48:46 +03:00
49ac8872b4 cuda : organize vendor-specific headers into vendors directory (llama/8746)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-08-08 22:48:46 +03:00
8ef98ae7e3 add conv support (llama/8688) 2024-08-08 22:48:46 +03:00
e471adcfa5 feat: Support Moore Threads GPU (llama/8383)
* Update doc for MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in Makefile

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in CMake

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* CUDA => MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* MUSA adds support for __vsubss4

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Fix CI build failure

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-08-08 22:48:46 +03:00
aa816c922c ggml : ignore more msvc warnings (ggml/906) 2024-08-08 22:48:46 +03:00
b3264eb266 metal : fix struct name (ggml/912)
ggml-ci
2024-08-08 22:48:46 +03:00
eb2eb87a58 metal : add abort callback (ggml/905) 2024-08-08 22:48:46 +03:00
83fcb0e486 vulkan : implement Stable Diffusion operators (ggml/904)
* Fix Vulkan repeat op

* Implement Vulkan concat op

* Delete old Vulkan shader generator

* Implement Vulkan im2col op

* Implement Vulkan unary gelu_quick op

* Implement Vulkan group_norm op

* Implement Vulkan timestep_embedding op

* Implement Vulkan upscale op

* Fix Vulkan vk_context tensor extra index issue

* Fix Vulkan matmul shader parameter bug

* Properly fix Vulkan matmul shader parameter bug

* Add Vulkan ADD f16 + f32 -> f16 operator support

* Implement Vulkan tanh op

* Fix Vulkan group count too large Validation error on non-Nvidia GPUs

* Throw error when too much memory is requested

* Fix another Vulkan group count too large Validation error on non-Nvidia GPUs

* Fix matmul MMQ condition

* Implement Vulkan pad op

* Fix Vulkan crash when tensor is used multiple times in a compute graph

* Add Vulkan CONCAT f16 + f16 -> f16 op

* Add Vulkan LEAKY_RELU op
2024-08-08 22:48:46 +03:00
f7bb412878 ggml : move c parameter comment to ggml_rope_ext (ggml/901)
This commit moves the comment for the c parameter from ggml_rope to
ggml_rope_ext. The comment is currently incorrect as ggml_rope does not
have a c parameter (freq_factors tensor).

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-08-08 22:48:46 +03:00
ef6dcf0d0c ggml : resolve sync conflicst (ggml/0)
ggml-ci
2024-08-08 22:48:46 +03:00
c7ea4fd235 common : handle new quant types (ggml/0) 2024-08-08 22:48:46 +03:00
525f190917 ggml : add ggml-aarch64 (ggml/0) 2024-08-08 22:48:46 +03:00
dd916a2852 ggml : reduce hash table reset cost (llama/8698)
* ggml : reduce hash table reset cost

* fix unreachable code warnings after GGML_ASSERT(false)

* GGML_ASSERT(false) -> GGML_ABORT("fatal error")

* GGML_ABORT use format string
2024-08-08 22:48:46 +03:00
0620fe00ec ggml: handle ggml_init failure to fix NULL pointer deref (llama/8692)
`ggml_init` can fail if no unused context is found. In that case, a NULL-pointer deref will happen later in the code during a call to `ggml_set_on_alloc`.

This fixes it by bailing out if no context is found.
2024-08-08 22:48:46 +03:00
31d0a9a14f fix multi-gpu issue on sycl (llama/8554)
---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
2024-08-08 22:48:46 +03:00
c06970dd72 ggml : add and use ggml_cpu_has_llamafile() (llama/8664) 2024-08-08 22:48:46 +03:00
7598acf525 Re-add erroneously removed -fsycl from GGML_EXTRA_LIBS (llama/8667) 2024-08-08 22:48:46 +03:00
43ddfce969 sycl : Add support for non-release DPC++ & oneMKL (llama/8644)
* Update cmake to support nvidia hardware & open-source compiler
---------
Signed-off-by: Joe Todd <joe.todd@codeplay.com>
2024-08-08 22:48:46 +03:00
a7e6d2cd9c Vulkan IQ4_NL Support (llama/8613)
* Fix Vulkan matmul tests compile errors

* Add Vulkan IQ4_NL support

* Fix Vulkan DeepSeek-Coder-V2-Lite MoE support
2024-08-08 22:48:46 +03:00
86506b0c5c Allow all RDNA2 archs to use sdot4 intrinsic (llama/8629)
The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.
2024-08-08 22:48:46 +03:00
11182fae34 fix scratch size of softmax (llama/8642) 2024-08-08 22:48:46 +03:00
0bc8bffe1d ggml: fix compile error for RISC-V (llama/8623) 2024-08-08 22:48:46 +03:00
8c4f30497a CUDA: MMQ code deduplication + iquant support (llama/8495)
* CUDA: MMQ code deduplication + iquant support

* 1 less parallel job for CI build
2024-08-08 22:48:46 +03:00
b1ee3a8444 gguf : handle null name during init (llama/8587) 2024-08-08 22:48:46 +03:00
be9a16fd3f ggml : fix quant dot product with odd number of blocks (llama/8549)
* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix odd blocks for ARM_NEON (llama/8556)

* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix q4_1

* ggml : fix q5_0

* ggml : fix q5_1

* ggml : fix iq4_nl metal

ggml-ci

* ggml : fix q4_0

* ggml : fix q8_0

ggml-ci

* ggml : remove special Q4_0 code for first 2 blocks

* ggml : fix sumf redefinition

---------

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-08 22:48:46 +03:00
f4d9a95b0f ggml : add friendlier error message to fopen errors (llama/8575)
* Add additional error information when model files fail to load.

* Adding additional error information to most instances of fopen.
2024-08-08 22:48:46 +03:00
a8ab3abe09 CUDA: fix partial offloading for ne0 % 256 != 0 (llama/8572) 2024-08-08 22:48:46 +03:00
65a
fb6a835938 cmake : install all ggml public headers (llama/8480)
Co-authored-by: 65a <65a@65a.invalid>
2024-08-08 22:48:46 +03:00
8923bb4292 Add Ascend NPU backend (llama/6035)
* [CANN] Add Ascend NPU backend

Ascend is a full-stack AI computing infrastructure for industry
applications and services based on Huawei Ascend processors and
software.

CANN (Compute Architecture of Neural Networks), developped by
Huawei, is a heterogeneous computing architecture for AI.

Co-authored-by: wangshuai09 <391746016@qq.com>

* delete trailing whitespaces

* Modify the code based on review comment

* Rename LLAMA_CANN to GGML_CANN

* Make ggml-common.h private

* add ggml_cann prefix for acl funcs

* Add logging for CANN backend

* Delete Trailing whitespace

---------

Co-authored-by: wangshuai09 <391746016@qq.com>
2024-08-08 22:48:46 +03:00
fcba6aa352 make/cmake: add missing force MMQ/cuBLAS for HIP (llama/8515) 2024-08-08 22:48:46 +03:00
8807fe608b Refactor lora adapter support (llama/8332)
* lora: load to devide buft

* add patch tensor function

* correct tensor patch

* llama_lora_adapter_apply

* correct ggml_backend_tensor_copy

* add llm_build_mm

* fix auto merge

* update based on review comments

* add convert script

* no more transpose A

* add f16 convert

* add metadata check

* add sanity check

* fix ftype

* add requirements

* fix requirements

* fix outfile

* conversion: only allow selected models

* fix types

* cuda : do not use dmmv if the tensor does not have enough cols

* llama : lora fixes

* do not disable mmap with lora

Co-authored-by: slaren <slarengh@gmail.com>

* llm_build_lora_mm_id

* convert_lora : MoE LoRA conversion support

* convert_lora : prefer safetensors, similarly to convert_hf

* convert_hf : simplify modify_tensors for InternLM2

* convert_lora : lazy conversion

* llama : load and use alpha from LoRA adapters

* llama : use llm_build_lora_mm in most model graphs

* auto scale

* Revert "auto scale"

This reverts commit 42415a4874e0f963e4aca6796ea5dfb97cd17464.

* remove redundant params

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* change kv metadata

* move add_type to __init__

* convert_hf : move add_type to main()

* convert_lora : use the GGUFWriter from Model instead of overwriting it

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-08-08 22:48:46 +03:00
3e94c7a81d add concat through dim 1/2 (llama/8483)
* add concat through dim 1/2
2024-08-08 22:48:46 +03:00
77af3254e1 Vulkan MMQ Fix (llama/8479)
* Fix incoherence by adding missing LOAD_VEC_A parameter

* Fix Vulkan op result checker build error
2024-08-08 22:48:46 +03:00
d4b3cffec4 vulkan : cmake integration (llama/8119)
* Add Vulkan to CMake pkg

* Add Sycl to CMake pkg

* Add OpenMP to CMake pkg

* Split generated shader file into separate translation unit

* Add CMake target for Vulkan shaders

* Update README.md

* Add make target for Vulkan shaders

* Use pkg-config to locate vulkan library

* Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow

* Clean up tabs

* Move sudo to apt-key invocation

* Forward GGML_EXTRA_LIBS to CMake config pkg

* Update vulkan obj file paths

* Add shaderc to nix pkg

* Add python3 to Vulkan nix build

* Link against ggml in cmake pkg

* Remove Python dependency from Vulkan build

* code review changes

* Remove trailing newline

* Add cflags from pkg-config to fix w64devkit build

* Update README.md

* Remove trailing whitespace

* Update README.md

* Remove trailing whitespace

* Fix doc heading

* Make glslc required Vulkan component

* remove clblast from nix pkg
2024-08-08 22:48:46 +03:00
b852a4c5ca metal : template-ify some of the kernels (llama/8447)
ggml-ci
2024-08-08 22:48:46 +03:00
2157abaab4 ggml : minor naming changes (llama/8433)
* ggml : minor naming changes

ggml-ci

* ggml : use PRId64 [no ci]

* ggml : revert FA K/Q names
2024-08-08 22:48:46 +03:00
68d609a12c fix the mul_mat_id ut issues (llama/8427)
* fix part of mul_mat_id

* skip the bfloat 16 sycl ut

Signed-off-by: Chen Xi <xi2chen@intel.com>

---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Chen Xi <xi2chen@intel.com>
2024-08-08 22:48:46 +03:00
5a8ae474f0 ggml : add NVPL BLAS support (ggml/8329) (llama/8425)
* ggml : add NVPL BLAS support

* ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>`

---------

Co-authored-by: ntukanov <ntukanov@nvidia.com>
2024-08-08 22:48:46 +03:00
84493d7f3e cuda : suppress 'noreturn' warn in no_device_code (llama/8414)
* cuda : suppress 'noreturn' warn in no_device_code

This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:

```console
/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
  346 | }
      | ^
```

The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! cuda : suppress 'noreturn' warn in no_device_code

Update __trap macro instead of using a while loop to suppress the
warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-08-08 22:48:46 +03:00
15d71189e9 CUDA: optimize and refactor MMQ (llama/8416)
* CUDA: optimize and refactor MMQ

* explicit q8_1 memory layouts, add documentation
2024-08-08 22:48:46 +03:00
37e962580f Use multi_ptr to clean up deprecated warnings (llama/8256) 2024-08-08 22:48:46 +03:00
db0ea7a2f2 ggml : move sgemm sources to llamafile subfolder (llama/8394)
ggml-ci
2024-08-08 22:48:46 +03:00
5498b0e6c0 ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (llama/5780)
* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files

* Arm AArch64: minor code refactoring for rebase

* Arm AArch64: minor code refactoring for resolving a build issue with cmake

* Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code change for resolving a build issue with server-windows

* retrigger checks

* Arm AArch64: minor code changes for rebase

* Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits

* Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig

* Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code refactoring

* Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat

* Arm AArch64: minimize changes in ggml_compute_forward_mul_mat

* Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* rebase on the latest master commit 3fd62a6 and adapt to the new directory structure

* Arm AArch64: remove a redundant comment

* Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off

* Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels

* Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type
2024-08-08 22:48:46 +03:00
2af4a52c39 sycl : Reenabled mmvq path for the SYCL Nvidia Backend (llama/8372)
* SYCL : Reenabled mmvq path for the SYCL Nvidia Backend

* Reduced verbosity of comment
2024-08-08 22:48:46 +03:00
eee2fe882e sycl : fix powf call in device code (llama/8368) 2024-08-08 22:48:46 +03:00
0d1a11e5e2 ggml : loop tiling optimizations for scalar path (ggml/898)
Apply a loop tiling technique to the generic path, which provides
performance upside for ISAs with enough registers to take advantage
of it. Also helps the compiler optimize this path.
2024-08-08 22:48:46 +03:00
b2ead7d6f4 ggml: add support for float16 input tensors in pooling operations (ggml/895)
* Add support for float16 tensors in 1d pooling operations

* Add support for float16 input tensors in 2d pooling operations

* code cleanup

remove unnecessary casting during srow ptr initialization

---------

Co-authored-by: vanaka11 <vanaka1189@gmail.com>
2024-08-08 22:48:46 +03:00
8da6fd4dff vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893)
This prevents invalid frees when destroying a partially initialized
vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer
when running out of device memory.

Co-authored-by: Tony Wasserka <neobrain@users.noreply.github.com>
2024-08-08 22:48:46 +03:00
ab8ec9e940 cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885) 2024-08-08 22:48:46 +03:00
701265bf38 scripts : sync new files (#0) 2024-08-08 22:48:46 +03:00
fe36c90971 cmake : fix compile in xcode (#2311) 2024-08-05 09:48:26 +03:00
6739eb83c3 whisper : handle empty mel (#2324) 2024-07-27 20:35:04 +03:00
f68298ce06 whisper : use vulkan as gpu backend when available (#2302)
* ggml: use vulkan as gpu backend when available

Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>

* whisper: enable using vk as default buffer type

Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>

---------

Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
2024-07-16 10:21:09 +03:00
7ae885c1ef whisper : fix DTW assert (#2299) 2024-07-15 15:50:36 +03:00
d207c68822 cmake : use WHISPER_EXTRA_FLAGS (#2294) 2024-07-09 18:54:18 +03:00
16d72504fe cmake : allow external ggml 2024-07-09 11:38:15 +03:00
1c31f9d4a8 cmake : try to fix openvino build (#2281) 2024-07-08 15:36:51 +03:00
8ecb2f1f68 cmake : remove install of llama convert script [no ci] (#2266) 2024-07-08 14:53:55 +03:00
5226c3d45c make : remove llama prints [no ci] (#2265) 2024-07-08 14:53:55 +03:00
dbf9c15e30 talk-llama : sync llama.cpp 2024-07-08 14:53:55 +03:00
d3f6c34976 examples : fix compile warnings [no ci] (#0) 2024-07-08 14:53:55 +03:00
425e2910a3 sync : ggml 2024-07-08 14:53:55 +03:00
49868aa851 ggml : sync sycl (skip) (#0) 2024-07-08 14:53:55 +03:00
ff08e30ab5 scripts : fix sync scripts 2024-07-08 14:53:55 +03:00
95f2a191c0 ggml : remove unnecessary UNUSED macro call (ggml/880)
This commit removes an UNUSED macro call that is not needed as the
variable n0 is used in the code and will not produce a warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-08 14:53:55 +03:00
00422ec3cf cmake : add GGML_BUILD and GGML_SHARED macro definitions (llama/8281) 2024-07-08 14:53:55 +03:00
c5b05321e9 Enabled more data types for oneMKL gemm_batch (llama/8236) 2024-07-08 14:53:55 +03:00
5dc636a65a CUDA: MMQ support for iq4_nl, iq4_xs (llama/8278) 2024-07-08 14:53:55 +03:00
73703a144f CUDA: revert part of the RDNA1 optimizations (llama/8309)
The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s
2024-07-08 14:53:55 +03:00
e89fdceec2 CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (llama/8311) 2024-07-08 14:53:55 +03:00
29a2739d27 Fix WARP_SIZE=16 bug of Intel GPU (llama/8266)
* fix group_norm ut

* split softmax

* fix softmax

* add concat support condition

* revert debug code

* move QK_WARP_SIZE to presets.hpp
2024-07-08 14:53:55 +03:00
ee6d17f6b4 rm get_work_group_size() by local cache for performance (llama/8286)
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-07-08 14:53:55 +03:00
95e90823d9 Define and optimize RDNA1 (llama/8085) 2024-07-08 14:53:55 +03:00
005cc45df3 fix typo (llama/8267)
Co-authored-by: Judd <foldl@boxvest.com>
2024-07-08 14:53:55 +03:00
c2c60dc9ba Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (llama/8258) 2024-07-08 14:53:55 +03:00
4af3194b7c cuda : update supports_op for matrix multiplication (llama/8245) 2024-07-08 14:53:55 +03:00
4a2ba1a065 Fix win build conflict of math library (llama/8230)
* fix win build conflict of math library

* fix the condition: !(win32 & SYCL)

* revert warp_size=16
2024-07-08 14:53:55 +03:00
f096cc6807 Fix the sub group size of Intel (llama/8106)
* use warp_size macro for all sycl kernels

* fix mask of permute_sub_group_by_xor

* fix rms_norm with correct warp number

* fix rms_norm_f32/group_norm_f32

* move norm to norm.cpp file

* fix quantize bug

* fix mmvq's batch size
2024-07-08 14:53:55 +03:00
e4bc83ab47 CUDA: refactor and optimize IQ MMVQ (llama/8215)
* CUDA: refactor and optimize IQ MMVQ

* uint -> uint32_t

* __dp4a -> ggml_cuda_dp4a

* remove MIN_CC_DP4A checks

* change default

* try CI fix
2024-07-08 14:53:55 +03:00
db7e0dbe6e Update SYCL-Rope op and Refactor (llama/8157)
* align with rope.cu and move sycl-op to a single file
2024-07-08 14:53:55 +03:00
bf88c94da9 CUDA: fix MMQ stream-k for --split-mode row (llama/8167) 2024-07-08 14:53:55 +03:00
3eea171cab feat: cuda implementation for ggml_conv_transpose_1d (ggml/854)
* conv transpose 1d passing test for 1d input and kernel

* working for different input and output channel counts, added test for variable stride

* initial draft appears to work with stride other than 1

* working with all old and new conv1d  tests

* added a test for large tensors

* removed use cuda hardcoding

* restored test-conv-transpose.c

* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail

* fixed accumulator bug

* added test to test-backend-ops

* fixed mistake

* addressed review

* fixed includes

* removed blank lines

* style and warning fixes

* return failure when test fails

* fix supports_op

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-07-08 14:53:55 +03:00
64a56ebf13 ci : disable java build 2024-07-08 14:26:59 +03:00
bec9836849 server : add inference path to make OAI API compatible (#2270) 2024-07-08 14:24:58 +03:00
c118733a29 sync : ggml + fix sync script 2024-06-26 23:20:19 +03:00
bb3dd45524 make : disable CUDA graphs 2024-06-26 23:20:13 +03:00
04e7fa6f4f ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (llama/8140) 2024-06-26 23:18:11 +03:00
9f7f36d4c9 make : disable CUDA mel build 2024-06-26 22:25:25 +03:00
4a62efbb95 cmake : minor fixes 2024-06-26 21:42:39 +03:00
0a55a70b9b make : fix missing -O3
same as https://github.com/ggerganov/llama.cpp/pull/8143
2024-06-26 21:21:12 +03:00
dc8cc2dd6f whisper : disable CUDA mel + fix FFMPEG 2024-06-26 20:11:38 +03:00
3efedb9511 sync : ggml 2024-06-26 19:40:23 +03:00
e30c679928 whisper : reorganize source code + improve CMake (#2256)
* scripts : update sync [no ci]

* files : reorganize [no ci]

* sync : llama.cpp

* cmake : link math library

* cmake : build normal ggml library

* files : move headers to include

* objc : fix path to ggml-metal.h

* ci : fix WHISPER_CUDA -> GGML_CUDA

* scripts : sync LICENSE [no ci]
2024-06-26 19:34:09 +03:00
bf4cb4abad whisper : optimize fft() function (#2242)
Co-authored-by: Mike Fan <60965742+mike-fzy@users.noreply.github.com>
2024-06-18 18:10:33 +03:00
e293f17d34 talk-llama : sync llama.cpp 2024-06-18 09:45:37 +03:00
5d950c4b8d whisper : use ggml_backend_sched (#2239)
* whisper : use ggml_backend_sched (wip)

* use sched in whisper_allocr

* whisper : single backend in whisper_context

* whisper : remove whisper_state->backends_used

* whisper : remove whisper_context->backend

* whisper : reset scheduler after init

* whisper : fix external encoder (e.g. CoreML)

* whisper : cleanup

* whisper : handle null GPU buffer types + fix sycl

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-18 09:39:40 +03:00
820446e230 fix : remove extra files 2024-06-18 09:39:40 +03:00
54d5823ebe scripts : sync ggml-blas 2024-06-18 09:39:40 +03:00
5181494e9f build : update make / cmake 2024-06-18 09:39:40 +03:00
4a6e6e8b30 sync : ggml 2024-06-18 09:39:40 +03:00
de29b193f6 move BLAS to a separate backend (cont) (llama/6210)
ggml-ci
2024-06-18 09:39:40 +03:00
922971041b Vulkan Shader Refactor, Memory Debugging Option (llama/7947)
* Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory

* Improve debug log code

* Add memory debug output option

* Fix flake8

* Fix unnecessary high llama-3 VRAM use
2024-06-18 09:39:40 +03:00
63a767a134 scripts : stop sync whisper example from ggml 2024-06-18 09:39:40 +03:00
30841fa786 cmake : fix sycl build (#0) 2024-06-16 18:19:48 +03:00
3b1ac03828 ggml : remove OpenCL (#0) 2024-06-16 18:19:48 +03:00
990de617b5 sycl : sync (#0) 2024-06-16 18:19:48 +03:00
6975600b4b cuda : enable CUDA graphs (#0) 2024-06-16 18:19:48 +03:00
061eeb9f61 talk-llama : sync llama.cpp 2024-06-16 18:19:48 +03:00
4942b1b428 cmake : fix CUDA build (#0) 2024-06-16 18:19:48 +03:00
3c7cc5c437 sync : ggml
ggml-ci
2024-06-16 18:19:48 +03:00
5cd42ee2cc ggml : fix and optimize ppc64le (ggml/849)
* fix compile issues introduced by loongarch_asx

* restore quant changes to merge

* fix compile issues introduced by loongarch_asx

* further optimize by using vec_msum & vec_sum4s on ppc64le
2024-06-16 18:19:48 +03:00
ee718f3da6 ggml : remove duplicate include of ggml-common.h (ggml/853)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-06-16 18:19:48 +03:00
63eac1f608 remove global variables (llama/7710)
* separate DPCT helpers outside

* replace global variables with context

* remove useless extra

* update mul_mat condition

* remove duplicate buft initialization

* remove duplicate extra and global work group size

* remove useless backend check

* remove duplicated extras

* use macro for group_size and remove cuda-related
2024-06-16 18:19:48 +03:00
b17ba2815b CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)
* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores

* try CI fix

* try CI fix

* try CI fix

* fix data race

* rever q2_K precision related changes
2024-06-16 18:19:48 +03:00
7a489af2f3 metal : utilize max shared memory for mul_mat_id (llama/7935) 2024-06-16 18:19:48 +03:00
4a4ea13d6d rpc : fix ggml_backend_rpc_supports_buft() (llama/7918) 2024-06-16 18:19:48 +03:00
174a461fc6 move BLAS to a separate backend (llama/6210)
* move BLAS to a separate backend

* rename GGML_USE_OPENBLAS to GGML_USE_BLAS

* alloc : reuse same buffer when the same buffer type if used multiple times

* set number of threads automatically for openblas and blis

* sched : print assignments when GGML_SCHED_DEBUG env variable is set

* sched : allow ops with weights on an incompatible buffer type

This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
d8b7a24bc9 CUDA: fix broken oob check for FA vec f32 kernel (llama/7904) 2024-06-16 18:19:48 +03:00
acf3832c9c tests : add non-cont unary tests (llama/7857)
* tests : add non-cont unary tests

* ggml : update unary asserts and "supports_op"

ggml-ci
2024-06-16 18:19:48 +03:00
d29ac44303 ggml : improve ggml_is_contiguous logic (llama/7856)
* ggml : improve ggml_is_contiguous logic

ggml-ci

* ggml : support more contiguous cases

ggml-ci
2024-06-16 18:19:48 +03:00
12638dfef0 vulkan: select only one device for single gpu with multiple drivers (llama/7582) 2024-06-16 18:19:48 +03:00
f100b3b523 Update Vulkan RoPE implementation (llama/7818)
* Update Vulkan RoPE implementation

* Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception

Minor fixes

* Fix segfault when running out of VRAM

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-16 18:19:48 +03:00
a99e213a82 CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860) 2024-06-16 18:19:48 +03:00
7483d2b61c CUDA: use tensor cores for MMQ (llama/7676)
* CUDA: int8 tensor cores for MMQ (legacy quants)

* fix out-of-bounds writes

* __builtin_assume -> GGML_CUDA_ASSUME

* fix writeback returning too early
2024-06-16 18:19:48 +03:00
1fe5948227 use the correct SYCL context for host USM allocations (llama/7777)
Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>
2024-06-16 18:19:48 +03:00
760497e1ab CUDA: revise q8_1 data layout for mul_mat_q (llama/7824) 2024-06-16 18:19:48 +03:00
b172e7714c vulkan : reuse parent extra for views (llama/7806)
* vulkan : reuse parent extra for views

* Fix validation error when multiple compute contexts are used in a graph

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-06-16 18:19:48 +03:00
dc01aadb18 fix softmax r2r result wrong issue (llama/7811) 2024-06-16 18:19:48 +03:00
e08c62149b CUDA: refactor mmq, dmmv, mmvq (llama/7716)
* CUDA: refactor mmq, dmmv, mmvq

* fix out-of-bounds write

* struct for qk, qr, qi

* fix cmake build

* mmq_type_traits
2024-06-16 18:19:48 +03:00
abab4500fa ggml : refactor rope norm/neox (llama/7634)
* ggml : unify rope norm/neox (CPU)

* ggml : fix compile warning

* ggml : remove GLM rope mode

ggml-ci

* metal : better rope implementation

ggml-ci

* cuda : better rope implementation

ggml-ci

* naming : n_orig_ctx -> n_ctx_orig

ggml-ci

* dev : add reminders to update backends

ggml-ci

* vulkan : fix ggml_rope_ext() usage

* cuda : fix array size + indents

ggml-ci
2024-06-16 18:19:48 +03:00
e666315fa8 Allow number of nodes in CUDA graph to change (llama/7738)
Previously the code would have failed to cope in the case that the
number of nodes changes in an existing CUDA graph. This fixes the
issue by removing an unnecessary conditional.
2024-06-16 18:19:48 +03:00
3f869af14c ggml : remove OpenCL (llama/7735)
ggml-ci
2024-06-16 18:19:48 +03:00
cbacb7634c ggml : prevent builds with -ffinite-math-only (llama/7726)
This enforces a check that -fno-finite-math-only was set and that the operating
compiling mode is not in finite maths mode. This is because during rewriting of
silu and softmax for cpu #7154 there emerged an issue where the result that was
observed when >1 slot was nondeterministic as found by @JohannesGaessler.

@LostRuins narrowed the problem down to -ffinite-math-only which was theorised
to be due to SiLU, instead of flushing small values to 0, returns NaN or some
other garbage. @jart proposed a fix that @ggerganov then implemented in this fix

ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825
2024-06-16 18:19:48 +03:00
6cc3b022ee llama : offload to RPC in addition to other backends (llama/7640)
* llama : offload to RPC in addition to other backends

* - fix copy_tensor being called on the src buffer instead of the dst buffer

- always initialize views in the view_src buffer

- add RPC backend to Makefile build

- add endpoint to all RPC object names

* add rpc-server to Makefile

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-16 18:19:48 +03:00
e5e38d4920 ggml : use OpenMP as a thread pool (llama/7606)
* ggml: Added OpenMP for multi-threads processing

* ggml : Limit the number of threads used to avoid deadlock

* update shared state n_threads in parallel region

* clear numa affinity for main thread even with openmp

* enable openmp by default

* fix msvc build

* disable openmp on macos

* ci : disable openmp with thread sanitizer

* Update ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
2a6bab5655 Vulkan Mixture of Experts (MoE) support (llama/7628)
* Finish Vulkan mul_mat_id implementation

* Add Vulkan sum_rows and div ops

* Fix MUL_MAT_ID matrix matrix shader

* Fix MUL_MAT_ID matrix vector shader dispatch size

* Fix MUL_MAT_ID matrix vector shader and dispatch code

* Update Vulkan CPU offload for MUL_MAT_ID

* Fix crash when using split mode none and setting a main GPU
2024-06-16 18:19:48 +03:00
8c01c9b85c kompute : implement op_getrows_f32 (llama/6403)
op_getrows_f32 is required since https://github.com/ggerganov/llama.cpp/pull/6122
for the Vulkan w/ Kompute backend to be functional.

As such, implement this op to make this backend functional again.
2024-06-16 18:19:48 +03:00
d1123d795e fix bug introduced in using calloc (llama/7701)
compilade pointed this out on the previous MR
2024-06-16 18:19:48 +03:00
9b3d784020 Fix FlashAttention debug test, FP32 assert (llama/7684) 2024-06-16 18:19:48 +03:00
a16137d13d CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681) 2024-06-16 18:19:48 +03:00
5582039d0a CUDA: quantized KV support for FA vec (llama/7527)
* CUDA: quantized KV support for FA vec

* try CI fix

* fix commented-out kernel variants

* add q8_0 q4_0 tests

* fix nwarps > batch size

* split fattn compile via extern templates

* fix flake8

* fix metal tests

* fix cmake

* make generate_cu_files.py executable

* add autogenerated .cu files

* fix AMD

* error if type_v != FP16 and not flash_attn

* remove obsolete code
2024-06-16 18:19:48 +03:00
9a16c643e2 ggml : fix loongson compile warnings (llama/7537)
* ggml : fix loongson compile warnings

ggml-ci

* Fix loongarch quantize test fail.

Fix unexpected error introduced during rebase code.

* tests : disable json test due to lack of python on the CI node

ggml-ci

---------

Co-authored-by: junchao-loongson <zhaojunchao@loongson.cn>
2024-06-16 18:19:48 +03:00
10a8a23100 faster avx512 exp implementation (llama/7551)
* faster avx512 exp implementation

* x->r

* improve accuracy, handle special cases

* remove `e`
2024-06-16 18:19:48 +03:00
29cfeef77f ggml : fix loongarch build (O2 issue) (llama/7636) 2024-06-16 18:19:48 +03:00
e66e9ea25b metal : remove invalid asserts (llama/7617) 2024-06-16 18:19:48 +03:00
276779a849 metal : add missing asserts (llama/7617) 2024-06-16 18:19:48 +03:00
1f35ce61c1 ggml : fix YARN + add tests + add asserts (llama/7617)
* tests : add rope tests

ggml-ci

* ggml : fixes (hopefully)

ggml-ci

* tests : add non-cont tests

ggml-ci

* cuda : add asserts for rope/norm + fix DS2

ggml-ci

* ggml : assert contiguousness

* tests : reduce RoPE tests

ggml-ci
2024-06-16 18:19:48 +03:00
4b19cc3ed4 cuda : non-cont concat support (llama/7610)
* tests : add non-cont concat tests

* cuda : non-cont concat support

ggml-ci
2024-06-16 18:19:48 +03:00
a535d348dd llama-bench : add support for the RPC backend (llama/7435) 2024-06-16 18:19:48 +03:00
8f5dc729d9 ggml : use atomic_flag for critical section (llama/7598)
* ggml : use atomic_flag for critical section

* add windows shims
2024-06-16 18:19:48 +03:00
02fc147a0b examples : adapt to new ggml_concat (ggml/0) 2024-06-16 18:19:48 +03:00
109148ac84 ggml : fix typo in ggml.c (llama/7603) 2024-06-16 18:19:48 +03:00
3563473d2c Align GEMM dispatch (llama/7566)
* align GEMM dispatch
2024-06-16 18:19:48 +03:00
046834198d sycl : fix assert (llama/7563) 2024-06-16 18:19:48 +03:00
0a2ad9de06 vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (llama/7552) 2024-06-16 18:19:48 +03:00
39b0640b09 rpc : resource management rework (llama/7562)
* rpc : resource management rework

* address review comments
2024-06-16 18:19:48 +03:00
8dca71de64 fix ggml_sycl_mul_mat_id() to match the change of api (llama/7436)
* fix mul_mat_id to match the change of api

* rm comment

* rm unused or duplicated code, rename as review comment
2024-06-16 18:19:48 +03:00
812787cbc5 ggml : generalize GGML_OP_CONCAT (llama/7563)
* ggml : generalize GGML_OP_CONCAT (WIP)

ggml-ci

* tests : add dim != 2 tests

* metal : generalize concat kernel

* tests : naming

* cuda : generalize concat kernel

ggml-ci

* sycl : add warning and assert

* ggml : fix op params handling

* metal : bugfix kernel

ggml-ci

* ggml : reimplement CPU and Metal

* cuda : add asserts

ggml-ci

* ggml : fix ptrs

ggml-ci
2024-06-16 18:19:48 +03:00
68ef10805e update HIP_UMA #7399 (llama/7414)
* update HIP_UMA #7399

add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable.
- get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103)

* simplify code, more consistent style

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-16 18:19:48 +03:00
96fdb90f5f Allow multiple copy function pointers for CUDA graph kernel param updates (llama/7565)
CUDA graphs require parameter updates to kernels associated with
GGML_OP_CPY nodes. Previously the implementation only checked for a
single CUDA kernel in such nodes, but this caused a bug in cases where
2 such kernels exist. This fixes the issue by using a vector to allow
multiple function pointers to be stored and checked against.

Fixes #7942
2024-06-16 18:19:48 +03:00
e98f9ac554 Fix q_xxs using mul_mat_q (llama/7459) 2024-06-16 18:19:48 +03:00
02d481595b Add freq factors (llama/7495) 2024-06-16 18:19:48 +03:00
7091c7ab5a metal : add GGML_OP_REPEAT kernels (llama/7557)
ggml-ci
2024-06-16 18:19:48 +03:00
d70ccb75f5 metal : disable FA kernel for HS=256 (llama/7556)
ggml-ci
2024-06-16 18:19:48 +03:00
5ee048eb67 ggml : restore ggml_rope_xpos_inplace (ggml/0)
ggml-ci
2024-06-16 18:19:48 +03:00
37ed71c964 ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (llama/7433)
* Add SVE support for q4_0_q8_0 q8_0_q8_0

* remove ifdef
2024-06-16 18:19:48 +03:00
8cd7a3df37 ggml : silence UB sanitizer error during iq2_xxs quantization (llama/0) 2024-06-16 18:19:48 +03:00
04a3279320 ggml : remove ggml_flash_attn and ggml_flash_ff (llama/7463)
ggml-ci
2024-06-16 18:19:48 +03:00
45ddda8e0c ggml : drop support for QK_K=64 (llama/7473)
* ggml : drop support for QK_K=64

ggml-ci

* opencl : restore QK_K=256 define
2024-06-16 18:19:48 +03:00
c41317fd66 Update vulkan rope implementation to support frequency factors (llama/7475) 2024-06-16 18:19:48 +03:00
96b8419b27 CUDA: fix FA out-of-bounds reads (llama/7479) 2024-06-16 18:19:48 +03:00
3c63f4cf35 CUDA: fix FA out-of-bounds writes (llama/7465) 2024-06-16 18:19:48 +03:00
5848dfd9c8 cuda : fix compile warning (llama/7454) 2024-06-16 18:19:48 +03:00
29ab5d0326 CUDA: remove incorrect precision check (llama/7454) 2024-06-16 18:19:48 +03:00
c4d6958b3e cuda : fix rope + add tests (llama/7452)
* cuda : fix rope pos data

ggml-ci

* ggml : drop mode & 1 == 1 support for ggml_rope

ggml-ci

* ggml : support freq_factors for f16 rope (CPU)

ggml-ci

* tests : add rope tests using frequency factors

ggml-ci
2024-06-16 18:19:48 +03:00
c9dcb75118 llama : add phi3 128K model support (llama/7225)
* add phi3 128k support in convert-hf-to-gguf

* add phi3 128k support in cuda

* address build warnings on llama.cpp

* adjust index value in cuda long rope freq factors

* add long rope support in ggml cpu backend

* make freq factors only depend on ctx size

* remove unused rope scaling type 'su' frin gguf converter

* fix flint warnings on convert-hf-to-gguf.py

* set to the short freq factor when context size is small than trained context size

* add one line of comments

* metal : support rope freq_factors

* ggml : update ggml_rope_ext API to support freq. factors

* backends : add dev messages to support rope freq. factors

* minor : style

* tests : update to use new rope API

* backends : fix pragma semicolons

* minor : cleanup

* llama : move rope factors from KV header to tensors

* llama : remove tmp assert

* cuda : fix compile warning

* convert : read/write n_head_kv

* llama : fix uninitialized tensors

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
bbdbc3fc62 metal : handle F16 inf values, fix FA partial offload (llama/7434)
ggml-ci
2024-06-16 18:19:48 +03:00
28c207a541 CUDA: fix unused warning in mmq.cu (llama/7442) 2024-06-16 18:19:48 +03:00
c23f830983 CUDA: deduplicate mmq code (llama/7397) 2024-06-16 18:19:48 +03:00
caeeb32b41 rpc : track allocated buffers (llama/7411)
* rpc : track allocated buffers

ref: #7407

* rpc : pack rpc_tensor tightly
2024-06-16 18:19:48 +03:00
584cc1177a Update SYCL upscale operation (llama/7321)
* Update SYCL upscale operation

* Formatting

* Remove messages
2024-06-16 18:19:48 +03:00
cc1ae10989 ggml-opencl, llama: using reserve() if count already known (llama/7272) 2024-06-16 18:19:48 +03:00
eb26f55b40 ggml : add loongarch lsx and lasx support (llama/6454)
* add loongarch lsx and lasx optimize code

* Add loongarch compilation support to makefile

* revert stb_image.h

* opt bytes_from_nibbles_32 and sum_i16_pairs_float

* fix undeclared

* format code

* update

* update 2

---------

Co-authored-by: Jinyang He <hejinyang@loongson.cn>
2024-06-16 18:19:48 +03:00
eb2b086584 Add provisions for windows support for BF16 code including CMake provision for enabling AVX512_BF16 (llama/7258) 2024-06-16 18:19:48 +03:00
67919cfe11 Vulkan Embedding Fix (llama/7360)
* Fix empty Vulkan host buffers

Add fp32 fp16 matmul shader

Fix matmul shader alignment

* Remove deprecated tensor->backend uses

* Fix Vulkan validation errors on embedding models with no offloaded layers

* Fix Vulkan llava segfault when not offloading layers
2024-06-16 18:19:48 +03:00
bf5fc81a8a ggml : fix another case of quants nans (llama/7387) 2024-06-16 18:19:48 +03:00
2b07dc3186 ggml: implement quantized KV cache for FA (llama/7372) 2024-06-16 18:19:48 +03:00
951c463d39 cuda : clear error after buffer allocation failure (llama/7376) 2024-06-16 18:19:48 +03:00
7f257b210f Capture CUDA logging output (llama/7298)
* logging: output capture in cuda module

* fix compile error

* fix: vsnprintf terminates with 0, string use not correct

* post review

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-16 18:19:48 +03:00
705fe30a02 android : use "ci-android" branch for CI (llama/7341)
* android : use "ci-android" branch for CI

* ggml : disable SIMD exp and silu for 32-bit ARM

ggml-ci

* android : do not fetch, use add_subdirectory instead

* cmake : provide binary dir
2024-06-16 18:19:48 +03:00
45b5b95e29 CUDA: deduplicate FlashAttention code (llama/7352) 2024-06-16 18:19:48 +03:00
f2c47d1e6a cuda : add half2 __shfl_xor() for ROCm 5.5 (llama/7263) 2024-06-16 18:19:48 +03:00
b4bb9b9036 Update and fix Vulkan soft_max and argsort implementations (llama/7237)
* Update and fix Vulkan softmax implementation

* Update and fix Vulkan argsort implementation
2024-06-16 18:19:48 +03:00
2bc6483299 ggml : fix quants nans when all the group weights are very close to zero (llama/7313) 2024-06-16 18:19:48 +03:00
ec52f900e4 CUDA: faster large batch FA without tensor cores (llama/7314) 2024-06-16 18:19:48 +03:00
77d708fabb rpc : set SO_REUSEADDR for the server socket (llama/7320)
ref: #7293
2024-06-16 18:19:48 +03:00
c00149c861 ggml-quants, llama : removed excess checks (llama/7274) 2024-06-16 18:19:48 +03:00
574661f2e6 ggml : rewrite silu and softmax for cpu (llama/7154)
This change upstreams llamafile's vectorized expf() functions. This lets
us compute softmax and silu more accurately than the short[65536] lookup
table that GGML previously used to make this operation go faster. We can
support aarch64 and sse2+ with the worst case rounding error of 2ulp. It
makes make -j8 tests && ./tests/test-backend-ops -o SOFT_MAX -b CPU perf
go 1.5x faster for SSE2+FMA, 1.9x faster for AVX2+FMA and 2.1x on AVX512
2024-06-16 18:19:48 +03:00
7bd69349bf rpc : add command line arg for specifying backend memory
ref: #7293
2024-06-16 18:19:48 +03:00
488ad99c13 Add support for properly optimized Windows ARM64 builds with LLVM and MSVC (llama/7191)
* logging: add proper checks for clang to avoid errors and warnings with VA_ARGS

* build: add CMake Presets and toolchian files for Windows ARM64

* matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings

* ci: add support for optimized Windows ARM64 builds with MSVC and LLVM

* matmul-int8: fixed typos in q8_0_q8_0 matmuls

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* matmul-int8: remove unnecessary casts in q8_0_q8_0

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
7178cceeaa ggml : use dynamic thread scheduling for matrix multiplication (llama/6915)
* Just reordering some structs.

* Adding in the calls to mm_pause

* Passing around the state

* Renaming and moving a bunch of variables around.

* Extracting the logic to it's own function.

* Moving some variable definitions into the chunk function.

* Moving some variables around

* moving src1_cont inside

* Moving row_size

* adding the current_chunk

* Reorg the code.

* Formatting to match the orig patch

* starting to setup the chunking variables

* Starting the buildup of the loop

* The yield shouldn't be necessary.

* adding the looping structure based on the chunk configuration.

* Add in the re-chunking code.

* Making it much more likely to rechunk.

* disable resizing if numa is enabled.

* Updating comments with what we've learned.

* Fix formatting

* Couple more formatting fixes.

* More style fixes.

* Fix Warnings

* Going with unused because there's conditional logic that needs it.

* Update ggml.c

* Update ggml.c

---------
2024-06-16 18:19:48 +03:00
8d55ccdb8c Avoid unnecessarily disabling CUDA graphs (llama/7302)
As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts.
This fixes the issue by avoiding the consective update counter from incrementing unnecessarily
for tokens in which cuda graphs are disabled due to batch size > 1.
2024-06-16 18:19:48 +03:00
37a72cb170 ggml : tag ggml_tensor::backend as deprecated (llama/7290) 2024-06-16 18:19:48 +03:00
bf9b69284f Add missing " (llama/7303) 2024-06-16 18:19:48 +03:00
c4de1e19df ggml : add ggml_upscale_ext (ggml/814)
* initial commit with CPU implementation of upscale to shape and test, cuda implementation next

* experimental commit to see if dst shape is correct

* test version

* test

* removed unnecessary params

* refactor

* fixed tests

* ggml : metal impl + cleanup + sycl dev warnings

* patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior

* metal : fix upsacle op to support nb00 + style

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
5b7073cae1 scripts : update sync 2024-06-16 12:41:42 +03:00
b29b3b2924 whisper : use ggml-cuda in mel calc, set appropriate device (#2236)
* whisper : use ggml-cuda in mel calc, set appropriate device

* whisper : forbid cuda mel calc on devices with compute < 600, workaround for #2230
2024-06-13 13:16:07 +03:00
420b6abc54 cuda : fix HIPBLAS build (#2234) 2024-06-11 19:14:38 +03:00
99804b0f3e cuda : fix bounds check for src0 rows in MMVQ kernel (#2231)
* cuda : fix bounds check for src0 rows in MMVQ kernel

* Update ggml-cuda/mmvq.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-06-11 17:39:01 +03:00
c55964c956 ci : fix CUDA builds (#2232) 2024-06-11 17:21:30 +03:00
20c542c713 whisper : auto-grow working areas for mel_calc_cuda (#2227)
* whisper : auto-grow working areas for mel_calc_cuda, fixes #2226

* whisper : only calculate mel spectrogram on GPU if audio is <= 5 min
2024-06-10 21:51:32 +03:00
c2bdb960cd whisper : free whisper_mel instances (#2220) 2024-06-10 11:00:15 +03:00
87acd6d629 whisper : whisper_state/backend fixes (#2217)
* whisper : fixes

* ci : WHISPER_CUBLAS -> WHISPER_CUDA
2024-06-06 18:51:36 +03:00
f842d31171 whisper : calculate mel spectrogram directly into a ggml_tensor (#2208)
* whisper : calculate mel spectrogram directly into a ggml_tensor

* whisper : remove unused temp buffer from state

* whisper : fix not initializing wstate.embd_enc
2024-06-06 16:20:46 +03:00
ffef323c4c whisper : add CUDA-specific computation mel spectrograms (#2206)
* whisper : use polymorphic class to calculate mel spectrogram

* whisper : add cuda-specific mel spectrogram calculation

* whisper : conditionally compile cufftGetErrorString to avoid warnings

* build : add new files to makefile

* ruby : add new files to conf script

* build : fix typo in makefile

* whisper : suppress cub warning for deprecated C++ std in whisper-mel-cuda
2024-06-04 09:32:23 +03:00
af5833e298 whisper : remove speed_up and phase_vocoder* functions (#2198)
* whisper : fix cast warning

* whisper : remove phase_vocoder functions, ref #2195

* whisper : remove speed_up from whisper_full_params, closes #2195
2024-05-31 11:37:29 +03:00
b87494bb8f readme : add conan badge (#2196)
* Add conan badge

* Fix markdown formating
2024-05-30 15:43:28 +03:00
ad130431aa readme : add install instructions for Conan (#2189) 2024-05-30 15:06:15 +03:00
e130b66642 whisper: use global cache for sin/cos vals and Hann window (#2194)
- also rename Hanning to Hann as it's named after Julius von Hann
 as per Wikipedia
2024-05-29 19:09:21 +03:00
c7b6988678 release : v1.6.2 2024-05-27 10:35:09 +03:00
05042a782d Revert "whisper : remove extra backend instance (huh?)" (#2182)
This reverts commit 4caa64b73e.
2024-05-27 10:20:25 +03:00
a7dc2aab16 server : fix typo (#2181)
A simple comment typo, PR can be dismissed
2024-05-25 10:46:22 +03:00
22d46b7ba4 ruby : update bindings (#2154)
* update library files

* update whispercpp

* not needed for gem
2024-05-22 23:02:52 +03:00
c10db6ea28 release : v1.6.1 2024-05-21 18:44:37 +03:00
1b51fdf170 examples : add support for decoding input with ffmpeg (Linux) (#2133)
- search for ffmpeg libs/headers at cmake time
- added ffmpeg-transcode.cpp into libcommon if ffmpeg on
- hooked ffmpeg trancoding in common read_wav(...)
- passed test:
./main -m ggml-base.en.bin -f samples/jfk.mp3
2024-05-21 18:31:41 +03:00
adee3f9c1f node : add flash_attn param (#2170) 2024-05-20 09:08:48 +03:00
4798be1f9a ci: Update build.yml to suppress warnings about node.js versions (#2166)
* Update actions to suppress warnings about old node.js

https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/

* Update actions/upload-artifact, specify android cmdline-tools-version

* Use java 20

gradle 8.1 complains against 21
https://docs.gradle.org/current/userguide/compatibility.html
2024-05-19 11:49:26 +03:00
08981d1bac release : v1.6.0 2024-05-15 09:59:48 +03:00
7094ea5e75 whisper : use flash attention (#2152)
* whisper : use flash attention in the encoder

* whisper : add kv_pad

* whisper : remove extra backend instance (huh?)

* whisper : use FA for cross-attention

* whisper : use FA for self-attention

* whisper : simplify encoder FA

* whisper : add flash_attn runtime parameter

* scripts : add bench log

* scripts : add M1 Pro bench log
2024-05-15 09:38:19 +03:00
9d5771ae43 talk-llama : reject runs without required arguments (#2153)
* Extended talk-llama example to reject runs without required arguments.

Print warning and exit if models are not specified on the command line.

* Update examples/talk-llama/talk-llama.cpp

* Update examples/talk-llama/talk-llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-14 21:32:41 +03:00
f56b8305c4 sync : ggml 2024-05-14 19:16:32 +03:00
1056ad762c metal : support FA without mask + add asserts (llama/7278)
* ggml : fa without mask + add asserts

ggml-ci

* metal : support non-contiguous KV

ggml-ci
2024-05-14 19:16:29 +03:00
c451080c8b ggml : add RPC backend (llama/6829)
* ggml : add RPC backend

The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).

* set TCP_NODELAY

* add CI workflows

* Address review comments

* fix warning

* implement llama_max_devices() for RPC

* Address review comments

* Address review comments

* wrap sockfd into a struct

* implement get_alignment and get_max_size

* add get_device_memory

* fix warning

* win32 support

* add README

* readme : trim trailing whitespace

* Address review comments

* win32 fix

* Address review comments

* fix compile warnings on macos
2024-05-14 19:16:29 +03:00
8e7c22fbdb rm wait() (llama/7233) 2024-05-14 19:16:29 +03:00
e57e95eb0d CUDA: add FP32 FlashAttention vector kernel (llama/7188)
* CUDA: add FP32 FlashAttention vector kernel

* fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel
2024-05-14 19:16:29 +03:00
130f43e4b8 scripts : sync ggml-rpc 2024-05-14 19:15:35 +03:00
d8356a1cc2 whisper : fix model path encoding in windows (#2086)
* fix: model path encoding in windows

* fix: convert model path to wide string only for MSVC compiler
2024-05-14 09:43:41 +03:00
4ef8d9f44e server : return utf-8 (#2138) 2024-05-13 15:33:46 +03:00
3928dbd206 node : add audio_ctx and audio buffer params (#2123)
* node : add audio_ctx param

* node : support passing audio buffer directly

* node : parse audio_ctx in index.js

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-13 15:22:23 +03:00
2ced6f0742 cmake : fix HIP/ROCm build (#2102) 2024-05-13 15:18:43 +03:00
30f73109b8 node : add additional params (#2000)
* Add additional params to addon.node

* Add comma_in_time as parameter

* Fix tests
2024-05-13 15:15:43 +03:00
17fa62d3d3 js : remove un-needed request header from fetchRemote (#2119) 2024-05-13 15:13:19 +03:00
1da5edcde0 cmake : fix metal embed sources path (#2110) 2024-05-13 15:09:59 +03:00
0bb05b113d main : dont print timings with --no-prints (#2108)
Signed-off-by: Daniel Ziegenberg <daniel@ziegenberg.at>
2024-05-13 15:00:19 +03:00
f141b2b938 main : add options for temperature control (#2088)
Add two options:

```
-tp,       --temperature N     [0.00   ] The sampling temperature, between 0 and 1
-tpi,      --temperature-inc N [0.20   ] The increment of temperature, between 0 and 1
```

The sampling temperature, between 0 and 1. Higher values like 0.8 will
make the output more random, while lower values like 0.2 will make it
more focused and deterministic. If set to 0, the model will use log
probability to automatically increase the temperature until certain
thresholds are hit.

Signed-off-by: Daniel Ziegenberg <daniel@ziegenberg.at>
2024-05-13 14:59:44 +03:00
2b434c449e whisper : switch back to F32 mask (#0) 2024-05-13 14:43:43 +03:00
e93081f83f whisper.android : update example, add field to print timestamp (#2072) 2024-05-13 14:30:03 +03:00
b6bbce4ae9 cmake : fix json INTERFACE library (#2069) 2024-05-13 14:29:39 +03:00
7705dc52da main : fix double quote escaping in csv output (#2090) 2024-05-13 11:55:32 +03:00
e6acaf9d91 metal : tune soft_max number of threads (#0) 2024-05-13 11:02:26 +03:00
2c81e6fd51 whisper : remove old flash attn code (#0) 2024-05-13 11:02:26 +03:00
9506267ce5 ggml : try fix ppc64 (#0) 2024-05-13 11:02:26 +03:00
fbeb80b5f0 ggml : remove oboslete alibi code (skipme) (#0) 2024-05-13 11:02:26 +03:00
3fa7d29876 talk-llama : sync llama.cpp 2024-05-13 11:02:26 +03:00
fe179ae0cc sync : ggml 2024-05-13 11:02:26 +03:00
40aeeeecc4 ggml : optimize for ppc64le using VSX intrinsics (ggml/784)
* optimize for ppc64le using VSX intrinsics

* 1. code clean up by removing comments about overflow concern.

2. fix typo in suffix of scaling.

* Continue to fix typo in suffix of scaling for QK_K <> 256

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-13 11:02:26 +03:00
5a863fbe18 metal : fix indent (ggml/0) 2024-05-13 11:02:26 +03:00
91c646c61d ggml : restore sigmoid decl order (ggml/0) 2024-05-13 11:02:26 +03:00
accada542a ggml : resolve merge (ggml/0)
ggml-ci
2024-05-13 11:02:26 +03:00
e54329da7b ggml : full ALiBi support (llama/7192)
* ggml : full ALiBi support

* ggml : update ggml_soft_max_ext() CUDA, SYCL

* ggml : ggml_flash_attn_ext() support ALiBi (CPU)

* ggml : ggml_flash_attn_ext() support ALiBi (Metal)

* ggml : fix warning

* ggml : ggml_flash_attn_ext() support ALiBi (CUDA)

ggml-ci

* ggml : fix assert message

* vulkan : add dev notes

* ggml : require mask when using ALiBi

ggml-ci

* convert : fix convert for refact models
2024-05-13 11:02:26 +03:00
284fac39fb metal : fix flash attention kernel requirements (llama/7169)
* metal : fix flash attention kernel requirements

ggml-ci

* metal : fix ggml_metal_supports_op

ggml-ci
2024-05-13 11:02:26 +03:00
fe454b8d9e Minor arithmetic improvement to mmvq wrapper kernel (llama/7172) 2024-05-13 11:02:26 +03:00
c114b75aee Vulkan Bugfixes and Improvements (llama/7084)
* Modify mat mat mul shader for mul_mat_id, modify mat vec mul shaders for single call batch operation

* Further work towards MoE, disabled for now

* Disable MoE code (not ready yet), fix a number of bugs in shaders and Vulkan code

* Add softmax with f16 mask and pos buffer support

* Disable mul_mat_id shaders for now

* Fix flake8

* Fix validation errors caused by empty buffers on larger batch sizes
2024-05-13 11:02:26 +03:00
4be936b88b CUDA: generalize FP16 fattn vec kernel (llama/7061)
* CUDA: generalize FP16 fattn vec kernel

* disable unsupported head sizes for AMD in test

* try AMD fix

* fix batch size 2-8

* partially revert changes
2024-05-13 11:02:26 +03:00
26c550f772 opencl : alignment size converted from bits to bytes (llama/7090)
* opencl alignment size should be converted from bits to bytes

Reference: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#CL_DEVICE_MEM_BASE_ADDR_ALIGN

> Alignment requirement (in bits) for sub-buffer offsets.

* Update ggml-opencl.cpp for readability using division instead of shift

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-05-13 11:02:26 +03:00
24f0aa460b Introduction of CUDA Graphs to LLama.cpp (llama/6766)
* DRAFT: Introduction of CUDA Graphs to LLama.cpp

* FIx issues raised in comments

* Tidied to now only use CUDA runtime (not mixed with driver calls)

* disable for multi-gpu and batch size > 1

* Disable CUDA graphs for old GPU arch and with env var

* added missing CUDA_CHECKs

* Addressed comments

* further addressed comments

* limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake

* Added more comprehensive graph node checking

* With mechanism to fall back if graph capture fails

* Revert "With mechanism to fall back if graph capture fails"

This reverts commit eb9f15fb6fcb81384f732c4601a5b25c016a5143.

* Fall back if graph capture fails and address other comments

* - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS

- rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS

- updated Makefile build to enable CUDA graphs

- removed graph capture failure checking in ggml_cuda_error
  using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string
  if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context

- fixed several resource leaks

- fixed issue with zero node graphs

- changed fixed size arrays to vectors

- removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed

- removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row

- changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX

- code style fixes

- things to look into
  - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional
  - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes

* fix build without cuda graphs

* remove outdated comment

* replace minimum cc value with a constant

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-13 11:02:26 +03:00
69efc39d5c metal : use vm_allocate instead of posix_memalign on macOS (llama/7078)
* fix: use `malloc` instead of `posix_memalign` in `ggml-metal.m` to make it not crash Electron proccesses

* fix: typo

* fix: use `vm_allocate` instead of `posix_memalign`

* fix: don't call `newBufferWithBytesNoCopy` with `NULL` when `ggml_metal_host_malloc` returns `NULL`

* fix: use `vm_allocate` only on macOS
2024-05-13 11:02:26 +03:00
a2ad810118 ggml : introduce bfloat16 support (llama/6412)
* Introduce bfloat16 support

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───┐
    0b0000000000000000 brain16

This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───────────────────┐
    0b00000000000000000000000000000000 IEEE binary32

The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others

      ┌sign
      │
      │  ┌exponent
      │  │
      │  │    ┌mantissa
      │  │    │
      │┌─┴─┐┌─┴──────┐
    0b0000000000000000 IEEE binary16

This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16

* Remove GGML code that's not needed

* Minimize the GGML API surface area for BF16

* Remove bf16 luts

* Make the GGML header look nicer

* Fix documentation

* Apply ggerganov's fixes for test-backend-ops

* Add BF16 code for new ggml_validate_row_data() function
2024-05-13 11:02:26 +03:00
1ae1a9cd56 metal : fix unused warning 2024-05-13 11:02:26 +03:00
b5521fea19 Add an option to build without CUDA VMM (llama/7067)
Add an option to build ggml cuda without CUDA VMM
resolves
https://github.com/ggerganov/llama.cpp/issues/6889
https://forums.developer.nvidia.com/t/potential-nvshmem-allocated-memory-performance-issue/275416/4
2024-05-13 11:02:26 +03:00
9b84195225 gguf-split: add --no-tensor-first-split (llama/7072) 2024-05-13 11:02:26 +03:00
11c1df0436 CUDA: CUDART < 11.7 workaround for __hmax, __hmax2 (llama/7019) 2024-05-13 11:02:26 +03:00
c754494fdd switch to using localizedDescription (llama/7010) 2024-05-13 11:02:26 +03:00
1bce67999d metal : remove deprecated error code (llama/7008) 2024-05-13 11:02:26 +03:00
6c39ea46b6 metal : log more info on error (llama/6987) 2024-05-13 11:02:26 +03:00
156a33a990 ggml : add Flash Attention (llama/5021)
* ggml : add ggml_flash_attn_ext API

* ggml : fix GQA support in ggml_flash_attn_ext

* ggml : online attention (CPU)

* metal : initial implementation

* metal : f16 precision

* metal : reduce branches

* metal : specialize for head size

* wip : 8 rows per simd group

* wip : 4 rows per simd group

* wip : template for rows per warp

* metal : parallelize across KV size

* metal : parallel reduce across heads

* metal : efficient flash_attn_f16 implementation

* metal : avoid redundant loads of the attention

* metal : scale and mask in matrix form

* metal : fix comment

* llama : avoid ggml_cast, use F32 query

* metal : add parallel reduce version (disabled)

* metal : move output into local memory + optimize

- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments

* metal : add tests, fix scaling, support C > 32

* metal : improve precision

* ggml : fix f16 mad

* metal : minor

* metal : support Q > 8

* tests : add ATTN tests

* metal : disable buffer allocation logs

* tests : more

* metal : faster inner loop for C == 32

* metal : fix array initialization

* tests : ifdef

* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext

* ggml : fix ggml_soft_max mask requirement

* cuda : fix soft_max to use correct mask size

* cuda : add flash_attn kernel (wip)

* metal : optimize softmax for C > 32

* metal : optimize softmax

* tests : minor fix

* cuda : avoid zeroing fragments

* tests : update dims

* cuda : fix __hisinf() result check

* cuda : avoid warp_reduce for smax

* cuda : use int instead of int64_t

Noticeably improves performance (thanks to Johannes)

* cuda : make loops use the same loop values

Thanks Johannes again for the tip

* cuda : unroll some of the loops

* cuda : avoid __hisinf branches

* cuda : use half2 in softmax

* cuda : switch to 1 warp for bs > 16

* cuda : speed-up reduce part of the kernel

* cuda : unroll Q*K^T loop

* cuda : fix -INF block check

* cuda : simplify softmax

* cuda : fix matrix names

* cuda : minor

* llama : adapt to F16 KQ_pos

* llama : adapt new models to F16 KQ_mask

* ggml : fix F16 store (ARM NEON)

* llama : fix type of KQ_mask and KQ_pos

* ggml : fix CPU soft_max

* tests : add hs=256

* cuda : fix build

* metal : improve perf via smaller int registers

* cuda : adapt soft_max to F16 mask and pos

* CUDA: faster FlashAttention, kernel for bs == 1

* 16 cols for Phi-2

* no vec for hs, no hs==256 ncols==32 for Volta

* adjust kernel selection logic

* 4 warps, 256 stride for all D

* no ncols == 64

* Multiple parallel blocks for batch size 1

* fix compile warnings

* fix excessive KQ_b loads

* fix cmake build

* fix KV cache padding, NaN from INFINITY (llama/6438)

* llama : flash_attn cparam + fix defrag

* server: support flash_attn param

* server: bench: enable flash_attn param

* CUDA: refactor host code, dyn. par. blocks

* fix flash_attn_vec_f16 race condition

* flush softmax exp below threshold to 0

* store temp KQ in registers

* Calculate KQ as FP32 if KQV has GGML_PREC_F32

* Add __hgt2_mask implementation for CUDA 11

* fix KQ FP32 precision fpr parallel_blocks > 1

* llama-bench : add -fa,--flash-attn arg

* metal : add BS=1 kernel for flash attention (llama/6508)

* metal : add BS=1 kernel for flash attention (wip)

* metal : support more than 1 warps

* metal : opts

* metal : opt

* metal : switch to parallel reduce

* metal : reduce registers

* metal : simplify

* metal : initial FA vec kernel

* metal : use F32 attention accumulators

* batched-bench : add fattn arg

* llama : simplify llama_build_kv_store

ggml-ci

* llama : adapt build_olmo to changes

* ggml : fix arm fp16 store on windows

* metal : clean-up

* metal : clean-up kernel code

* metal : minor

* tests : remove benchmarks

ggml-ci

* ggml : fix avx512 const correctness

ggml-ci

* ggml : fix soft_max with bias on CPU

ggml-ci

* common : print --flash-attn in help

* ggml : fix num dimensions in ggml_flash_attn_ext

* llama : force disable flash attention for incompatible models

* ggml : ggml_soft_max support F16/F32 mask/pos

ggml-ci

* cuda : uint -> uint32_t

* cuda : "constexpr dim3" -> "const dim3"

ggml-ci

* cuda : try to fix __hgt2_mask

ggml-ci

* ggml : add TODO's for F16/F32 mask/pos support in other backends

* llama : replace bool need_kq_pos with use_alibi

* llama : prep ALiBi support for BERT models

ggml-ci

* llama : fix n_batch requirements

ggml-ci

* cont

* server : add help for --flash-attn arg

* llama : disable FA for AMD

* tests : remove TMP_ATTN_BENCH

ggml-ci

* llama : support save/load state with FA enabled

ggml-ci

* ci : add CUDA save-load-state tests

ggml-ci

* llama : llama_kv_cache_clear zeroes data + fix save-load seq

ggml-ci

* llama : fix copy-paste errors, add TODO

* llama : disallow incompatible states

* llama : update llama_state_get_size after v_trans field

* metal : remove tmp log

* llama : add static reminder for llama_state_get_size

* metal : fix max nsg

ggml-ci

* ci : fix arg order

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
2024-05-13 11:02:26 +03:00
5167ebdfca ggml : fix __MSC_VER -> _MSC_VER (llama/6977)
ggml-ci
2024-05-13 11:02:26 +03:00
b574646d75 Fix more int overflow during quant (PPL/CUDA). (llama/6563)
* Fix more int overflow during quant.

* Fix some more int overflow in softmax.

* Revert back to int64_t.
2024-05-13 11:02:26 +03:00
388c3462a6 gguf : enforce that tensor names are unique (llama/6905)
* not allow adding duplicated tensor name

* no duplicated tensor while reading gguf

* typo

* throw exception inside llama_model_loader

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-13 11:02:26 +03:00
9ad202bee9 add device version in device list (llama/6959)
Co-authored-by: arthw <>
2024-05-13 11:02:26 +03:00
f0d3fb4a7e Reset schedule earlier to allow overlap with ggml graph computation on device (llama/6933)
* Reset schedule earlier to allow overlap with graph computation on device
2024-05-13 11:02:26 +03:00
9d4c8b8aa5 add basic tensor data validation function (llama/6884)
* add basic tensor data validation function

* add --check-tensors command line argument

tensor validation is disabled by default and can be enabled by adding
`--check-tensors` to the command line arguments.

quantize always validates tensors.
2024-05-13 11:02:26 +03:00
ecfac1e240 gguf : fix mismatch between alloc and free functions (llama/6929) 2024-05-13 11:02:26 +03:00
6f7140f568 Merge pull request from GHSA-p5mv-gjc5-mwqv
* always use calloc

clamp n_kv on failure to read a kv

* ggml : alternative ctx->header.n_kv update

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-13 11:02:26 +03:00
05b17112cf ggml : fix redefinition of vaddvq_f32 for 32-bit ARM (llama/6906) 2024-05-13 11:02:26 +03:00
a15fb5cd79 ggml : fix MIN / MAX macros (llama/6904)
ggml-ci
2024-05-13 11:02:26 +03:00
63fd148d8f ggml : move 32-bit arm compat in ggml-impl.h (llama/6865)
ggml-ci
2024-05-13 11:02:26 +03:00
6c3971b29b llamafile : improve sgemm.cpp (llama/6796)
* llamafile : improve sgemm.cpp

- Re-enable by default
- Fix issue described in #6716
- Make code more abstract, elegant, and maintainable
- Faster handling of weirdly shaped `m` an `n` edge cases

* Address review comments

* Help clang produce fma instructions

* Address review comments
2024-05-13 11:02:26 +03:00
a6d264f331 ggml : fix calloc argument ordering. (llama/6820)
Latest gcc complains here:
/home/airlied/devel/llama.cpp/ggml-alloc.c: In function ‘ggml_gallocr_new_n’:
/home/airlied/devel/llama.cpp/ggml-alloc.c:374:59: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
  374 |     ggml_gallocr_t galloc = (ggml_gallocr_t)calloc(sizeof(struct ggml_gallocr), 1);
      |                                                           ^~~~~~
/home/airlied/devel/llama.cpp/ggml-alloc.c:374:59: note: earlier argument should specify number of elements, later size of each element

and a bunch more.

calloc is specified to take nmemb first then size, so realign the code.

In a couple of places there was a * x, 1 so I fixed those to use calloc properly.
2024-05-13 11:02:26 +03:00
2959686019 ggml : fix ggml_backend_cpu_supports_op() for CPY (llama/0) 2024-05-13 11:02:26 +03:00
c96b0a938e ggml : group all experts in a single ggml_mul_mat_id (llama/6505)
* ggml : group all experts in a single ggml_mul_mat_id
cuda : improve mmid row copy

* cuda : fix bin bcast with non-cont src0

* test-backend-ops : only run all mul mat tests for base types

* llama : disable moe offloading with SYCL

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-13 11:02:26 +03:00
c97796aa0f ggml : fix llamafile sgemm wdata offsets (llama/6710)
ggml-ci
2024-05-13 11:02:26 +03:00
7a4f7d825e ggml : add llamafile sgemm (llama/6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-05-13 11:02:26 +03:00
fdb2c87350 llama : add qwen2moe (llama/6074)
* support qwen2moe

* fix-review

* metal : support unary ops for nelements % 4 != 0

* metal : require contiguousness for float4 unary kernels

* metal : require contiguousness for float4 unary kernels (cont)

* fix-review

* names : for brevity "SHARED_EXP" -> "SHEXP"

* llama : reuse build_moe_ffn()

* llama : add model type name

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-13 11:02:26 +03:00
98c0b77e0c fix mul_mat_id() for new input, make the ut pass (llama/6682) 2024-05-13 11:02:26 +03:00
9d6d50d933 Added support for GGML_OP_CLAMP in Metal (llama/6662)
* Added support for GGML_OP_CLAMP in Metal

* Corrected size

---------

Co-authored-by: dave-fl <dave@Davids-MacBook-Pro.local>
2024-05-13 11:02:26 +03:00
c1320c1f0c fix memcpy() crash, add missed cmd in guide, fix softmax (llama/6622)
* disable mmap to fix memcpy crash, add missed cmd in guide, fix softmax

* refactor to disable mmap for SYCL backend

* fix compile error in other os

* refactor the solution, use host buf to fix it, instead of disable mmap

* keep to support mmap()

* use host buff to reduce malloc times

* revert to malloc/free solution, for threaad safe
2024-05-13 11:02:26 +03:00
66aaf03a7a CUDA: fix matrix multiplication logic for tests (llama/6667) 2024-05-13 11:02:26 +03:00
00a0947c65 metal : unify mul_mv_id kernels (llama/6556) 2024-05-13 11:02:26 +03:00
60f3713026 llama : add gguf_remove_key + remove split meta during quantize (llama/6591)
* Remove split metadata when quantize model shards

* Find metadata key by enum

* Correct loop range for gguf_remove_key and code format

* Free kv memory

---------

Co-authored-by: z5269887 <z5269887@unsw.edu.au>
2024-05-13 11:02:26 +03:00
37e6757453 feat: implemented sigmoid function (ggml/806)
* added sigmoid function

* implemented metal kernel for sigmoid

* implemented cuda kernel for sigmoid

* added sigmoid unary op and incremented count
2024-05-13 11:02:26 +03:00
8dcefdf4a9 build: fix and ignore msvc warnings (ggml/805) 2024-05-13 11:02:26 +03:00
73d13ad19a ggml : expose SSE3 and SSSE3 for MSVC when AVX is available (#2128) 2024-05-08 18:33:43 +03:00
b6680fab50 build : improve disabling AVX-512 (#2129)
* cmake : make WHISPER_NO_AVX512=ON disable all subsets of AVX-512

Previously it happened only for MSVC, but it makes sense to have the
same behavior for other compilers too.

* make : reorder x86 ISA extensions in chronological order

And update compiler flags at the end to ease modifying conditions.

* make : support WHISPER_NO_AVX512=1 for disabling all AVX-512 subsets.

That way you do not have to override each AVX-512 subset setting
individually if it has been turned on during autodetection.
2024-05-08 18:32:43 +03:00
f760756078 minor: add CMakeSettings.json to gitignore (#2094) 2024-05-08 11:03:21 +03:00
58210d6a76 examples : fix node compilation (#2115)
* node : fix compilation and update examples

* node : fix readme

* Update addon.node test
2024-05-02 22:52:55 +01:00
8fac6455ff make : change GNU make default CXX from g++ to c++ (#2100) 2024-04-28 22:54:21 +01:00
22b6598cc9 Remove unnecessary memory reallocation in fft (#2080)
fft_out needs to be twice the frame_size, not the frame_step.  It is resized in fft() anyway, but this change prevents an unnecessary reallocation.

n_fft must match the mel filter size, so it is best not to calculate it from the framesize.

We only need to get the magnitudes for half the spectrum since the other half is a mirror and not used in the mel filter loop later.
2024-04-28 18:36:12 +01:00
858452d58d models : disable old script (#2079) 2024-04-24 14:56:30 +03:00
7f85e1d7fd whisper : more prominent log message for sub-1s audio (#2065) 2024-04-24 14:46:06 +03:00
b0c3cbf2e8 main : pass nullptr when regex is empty (#2070) 2024-04-17 12:23:47 +03:00
a750868428 readme : add up-to-date repository for Python bindings (#2063)
README
2024-04-16 14:15:52 +03:00
7395c70a74 release : v1.5.5 2024-04-16 14:08:31 +03:00
9fab28135c server : add dtw (#2044)
* server.cpp: add dtw

* Update examples/server/server.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-15 22:16:58 +03:00
08d3eef97d build : fix embedded Metal library generation (#2045) 2024-04-15 20:23:05 +03:00
1b5439a6c2 node : support no timestamps (#2048)
* fix: node: do not compute timestamps if you do not need them

* feat: add no_timestamps parameter to node addon
2024-04-15 20:03:34 +03:00
c7f95b7ca2 build : detect AVX512 in Makefile, add AVX512 option in CMake (#2043)
* make : add AVX512 detection to Makefile and CMakeLists.txt

* make : autodetect more AVX512 instruction subsets

* cmake : do not default to AVX512, must be enabled explicitly

* cmake : enable a set of AVX512 subsets, when AVX512 is turned on

* make : consolidate AVX512 subsets, add AVX512 VBMI

* cmake : revert to NO AVX512 setting, add settings for AVX512 VNNI and VBMI

* make : re-introduce AVX512VNNI back

* cmake : remove superfluous comment line
2024-04-15 20:02:09 +03:00
5c554c04ff whisper.nvim : fix missing reference to "model" variable (#2049) 2024-04-15 19:41:28 +03:00
c383f091a1 whisper : update grammar-parser.cpp (#2058)
preceeding -> preceding
2024-04-15 19:40:27 +03:00
8f253ef3af sync : ggml 2024-04-09 20:27:55 +03:00
c7dc37f97c license : update copyright notice + add AUTHORS 2024-04-09 20:27:44 +03:00
526332873b llama : add Command R Plus support (llama/6491)
* Add Command R Plus GGUF

* Add Command R Plus GGUF

* Loading works up to LayerNorm2D

* Export new tensors in 1D so they are not quantized.

* Fix embedding layer based on Noeda's example

* Whitespace

* Add line

* Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda)

* dranger003: Fix block index overflow in CUDA dequantizing.

* Reverted blocked multiplication code as it still has issues and could affect other Llama arches

* export norms as f32

* fix overflow issues during quant and other cleanup

* Type convention

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* dranger003: Fix more int overflow during quant.

---------

Co-authored-by: S <seast@Ss-Mac-Studio.local>
Co-authored-by: S <s@example.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 20:26:18 +03:00
1d2721ca72 remove row=1 cond (llama/6532) 2024-04-09 20:26:18 +03:00
219e601dab support/fix OPs GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M (llama/6521) 2024-04-09 20:26:18 +03:00
3b8aade3c2 scripts : update sync 2024-04-09 20:25:50 +03:00
52ccd4a3a8 files : rename ./extra to ./scripts 2024-04-09 20:13:41 +03:00
5275074d37 whisper : fix DTW memory access (#2012)
* Fix DTW memory access

* Memory fix - Apply changes from denersc
2024-04-09 18:38:19 +03:00
c15b4cda7d common : fix file-handle leak in read_wav() (#2026)
Now it cleans up in case of error.
2024-04-09 18:34:34 +03:00
d3cfb6ca2b main : set stdin to binary mode on Windows (#2025) 2024-04-09 18:33:32 +03:00
956ef860bc cmake : support for CPU BLAS build via Intel MKL (#2024) 2024-04-09 18:32:46 +03:00
671b4bde6c main : allow a response-file as the sole parameter (#2019)
* The "main" example now allows a response-file as the sole parameter.

A response-file is a text file with command-line parameters, one per line.
Prefix the name of the response-file with "@" to identify it as such.
It's used under MS Windows to work around command-line length limits.
It may be useful under other platforms to simplify character-escaping.

* minor : style

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 18:31:16 +03:00
c8eeb93a6a whisper : suppress tokens with a regex (#1997)
* Allow a regular expression to describe tokens to suppress.

Example: --suppress-tokens-re "[,\.]|[ ]?[0-9]+" will suppress commas, periods, and numeric tokens.

Technique inspired by https://github.com/openai/whisper/discussions/1041

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Blind change to fix Java test.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 18:27:28 +03:00
319fe5146e cmake : create solution folders (#2004)
* Create solution folders in the CMake build.

* Fixed non-SDL2 build.

* Fixed emscripten build.
2024-04-09 18:23:33 +03:00
13c22321d1 sync : ggml 2024-04-07 17:04:56 +03:00
ccbe9d5676 extra : sync grammar-parser 2024-04-07 17:04:22 +03:00
81a3c41aa0 talk-llama : sync llama.cpp 2024-04-07 16:21:08 +03:00
a50207c65d sync : ggml 2024-04-07 16:18:11 +03:00
97878e53fd sync : llama.cpp (skip)
ggml-ci
2024-04-07 16:15:57 +03:00
61b05815e0 Fixed minor bug when enabling FP16 for non intel targets (llama/6464)
* moved INTEL_MKL guard from gemm_impl to gemm (wrapper)

* Update ggml-sycl.cpp

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>

---------

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>
2024-04-07 16:15:57 +03:00
1dce94cf26 ggml : mul_mat_id use the same tensor for all the experts (llama/6387)
* ggml : update mul_mat_id to use the same tensor for all the experts

* update cuda

* minor

* update metal

* update test-backend-ops

* fix cuda

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update convert.py

* update convert-hf-to-gguf.py

* update convert.py for mixtral hf models

* Update convert-hf-to-gguf.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* cuda : support non-pow-2 number of experts

* allow quantize to work for split and merged experts models in the same way

* cleanup + disable mmap automatically with split tensors models

* update imatrix

* test-backend-ops : test qwen argsort

* update grok model loading

* llama : add merged experts tensors to the grok tensor map

* minor

* gguf : bump version

* fix quantizing of merged experts

* convert-hf-to-gguf.py : update grok (untested)

* make linter happy

* cuda/argsort : use shared memory instead of pool memory

* convert : fix grok tensor names

* metal : add support for non-pow-2 argsort

* llama : more loader cleanup, better error checking

* cuda : fix warning

* llama : still use mmap for loading old models, but copy the data to a host buffer

* add review note

* llama : remove ffn tensor counting + add sanity check

ggml-ci

* convert : fix handling of n_experts == None

ggml-ci

* imatrix : fix ncall counters

* llama : produce error if imatrix size does not match

* quantize : terminate on errors + trace logs

ggml-ci

* metal : pad shared memory to 16 bytes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-07 16:15:57 +03:00
f12e982c0b Disable iqx on windows as WA (llama/6435)
* disable iqx on windows as WA

* array instead of global_memory
2024-04-07 16:15:57 +03:00
fa966b9b40 Vulkan k-quant mmq and ggml-backend offload functionality (llama/6155)
* Fix Vulkan no kv offload incoherence

* Add k-quant mul mat mat shaders

* Rework working buffer allocation, reduces vram use noticeably

Clean up cpu assist code, replaced with ggml-backend offload function

* Default to all dedicated GPUs

* Add fallback for integrated GPUs if no dedicated GPUs are found

* Add debug info which device is allocating memory

* Fix Intel dequant issue

Fix validation issue

* Fix Vulkan GGML_OP_GET_ROWS implementation

* Clean up merge artifacts

* Remove Vulkan warning
2024-04-07 16:15:57 +03:00
b83a9fc9d3 fix set main gpu crash (llama/6339) 2024-04-07 16:15:56 +03:00
3adbf2fb03 ggml : fix bounds checking of zero size views (llama/6347) 2024-04-07 16:15:56 +03:00
700d146127 backend : fix typo in scheduler documentation (ggml/781)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-07 16:15:56 +03:00
a74fde9b4c extra : sync ggml-cuda folder 2024-04-07 16:10:44 +03:00
1d7657f409 ggml: bypass code incompatible with CUDA < 11.1 (#2020)
`cudaHostRegisterReadOnly` parameter was only introduced in CUDA 11.1

See this issue for more details:
https://github.com/ggerganov/whisper.cpp/issues/2007
2024-04-04 14:49:24 +02:00
ac283dbce7 ci : add building in MSYS2 environments (Windows) (#1994) 2024-03-30 09:20:20 +02:00
1e8f28c42a build : use pkg-config for OpenBLAS (#1778)
* make : use pkg-config for finding CFLAGS & LDFLAGS needed by OpenBLAS

That way building on *nix like environments (including MSYS2 on Windows)
with WHISPER_OPENBLAS=1 works out of the box.

Fix handling of WHISPER_OPENBLAS, so that empty value or 0 won't be
misinterpreted by make as enabled.  Mind that it's not intended to
detect CMake false constants (OFF NO FALSE N).  make is not CMake.

By default OpenBLAS with 64-bit interface is used, but that can be
changed with `WHISPER_OPENBLAS_INTERFACE64=0` if 32-bit one is desired.

If OpenBLAS headers and library are respectively in include/ and lib/
subdirectories of given path, then you can specify it, e.g.
`OPENBLAS_PATH=/usr/local/openblas`, and this will take precedence over
any pkg-config file.

If there is no pkg-config file (.pc) for OpenBLAS and OPENBLAS_PATH is
empty, then headers are assumed to be in /usr/include/openblas and
library as assumed to be called 'openblas64' (or 'openblas' if
`WHISPER_OPENBLAS_INTERFACE64=0`).  If different headers location should
be used, then it can be done, e.g.
`WHISPER_BLAS_CFLAGS=-I/usr/local/include/openblas`.
If different library should be used, it can be specified, e.g.
`WHISPER_BLAS_LIB=openblasp64` (pthreads version as seen on Fedora), or
you can provide LDFLAGS needed to link with OpenBLAS directly:
`WHISPER_BLAS_LDFLAGS="-L/usr/local/lib/openblas -lopenblas64"`.

Current solution is flexible enough to handle most cases out there
without needlessly hardcoding possible OpenBLAS installation details.

* cmake : fix how pkg-config is used for finding include dirs and libraries needed by OpenBLAS

That way building on *nix like environments (including MSYS2 on Windows)
with -DWHISPER_OPENBLAS=ON should work out of the box as long as you
have CMake 3.25 or newer.

Make OPENBLAS_PATH environment variable supported not only on Windows.
It sets OpenBLAS include dir to ${OPENBLAS_PATH}/include and library to
${WHISPER_BLAS_LIB} (name without prefixes and suffixes) in
${OPENBLAS_PATH}/lib and avoids further package finding.

By default OpenBLAS with 64-bit interface is used (equivalent to setting
`-DWHISPER_BLAS_LIB=openblas64`), but that can be changed with
`-DWHISPER_OPENBLAS_INTERFACE64=OFF` (equivalent to setting
`-DWHISPER_BLAS_LIB=openblas`) if 32-bit one is desired.

Turn on BLA_STATIC for FindBLAS only when WHISPER_STATIC is enabled.
BLA_STATIC may not work as expected for pkg-config based operation.

Get rid of supporting BLAS_HOME environment variable.  If OPENBLAS_PATH
is insufficient in your case, there is no pkg-config file to rely on,
then you can manually specify include dir, e.g.
`-DBLAS_INCLUDE_DIRS=/usr/local/include/openblas`, and library, e.g.
`-DBLAS_LIBRARIES=/usr/local/lib/libopenblas.so`.

* make / cmake : use OpenBLAS with 32-bit interface by default.

OpenBLAS w/o INTERFACE64=1 vel USE_64BITINT=1 seems to be more common.

* cmake : hardcode "lib" prefix for OpenBLAS lib filename (even on Windows)

* cmake : hardcode OpenBLAS library name when building in MSVC (Windows)

Most *nix like environments (including MSYS2 on Windows) have OpenBLAS
packages that allow coexistence of OpenBLAS builds with 32-bit and
64-bit interface (w/o and w/ OPENBLAS_USE64BITINT defined) and they
differ by not having or having "64" suffix in their library filenames.
That's not the case for OpenBLAS prebuilt libraries for Windows.
2024-03-29 15:53:26 +02:00
fc366b807a main : add command-style grammar (#1998)
* Implemented command-style grammar in the main example.

Mostly just copied the relevant parts from the command example.

* main : code style

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-28 12:02:10 +02:00
9fb308d90f make : add grammar parser to common objects 2024-03-28 11:59:48 +02:00
2948c740a2 sync : ggml (#2001)
* sync : update scripts

* sync : ggml

* talk-llama : sync llama.cpp

* make : WHISPER_CUBLAS -> WHISPER_CUDA

* ci : try to fix sycl build

* talk-llama : fix make build
2024-03-27 18:55:10 +02:00
1558ec5a16 whisper : improve handling of prompts (#1981)
* whisper : improve handling of prompts

* whisper : add whisper_token_count helper
2024-03-25 14:48:19 +02:00
fff24a0148 whisper : improve support for distil-large-v3 (#1982) 2024-03-21 18:53:30 +02:00
48a145207e ruby : fix build (#1980) 2024-03-21 07:40:09 +02:00
79d5765e7e docker : libcuda.so.1 in PATH (#1966) 2024-03-20 18:45:15 +02:00
04e48094e4 readme : add Fedora dependencies (#1970)
* README.md

fix documentaion and added fedora liunx dependencies for stream build

* fix documentaion and added fedora liunx dependencies for command build

* fix documentaion and added fedora liunx dependencies for talk build

* fix documentaion and added fedora liunx dependencies for talk-llama build

* reverted back mistakenly removed MacOS documentaion
2024-03-20 18:42:11 +02:00
741abb162c whisper : token-level timestamps with DTW (#1485)
* whisper.cpp: impl dtw algo

* WIP: producing and placing DTW timestamps on tokens

* Fix compile and assertion errors. Attempt to DTW timestamp with single_segment=false.

* Fix mistake causing incorrect alignment of dtw timestamps

* implement N_TOP_MOST and CUSTOM alignment heads setting

* whisper: fix typo on alignment heads enum

* Fix issues related to changes in whisper.cpp

* Fixed excessive memory use when using DTW timestamps. Other minor fixes to DTW timestamping function

* decoder: save cross QKs only if requested

* Calling median filter with ggml_map_custom1

* Reimpl aheads n_top_most and custom. Sanity checks on chosen aheads

* Copying cross QKs from decoder backend correctly

* dtw: cleanup

* Fix incorrect n_frames passed to dtw when near end of audio

* Fix aheads_masks_init for backend != CPU

* whisper : minor style

* main : add dtw (wip)

* whisper: fix invalid memory access in aheads_masks_init

* main : add dtw (cont)

* whisper : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-20 18:25:26 +02:00
e7794a868f examples : rename --audio-context to --audio-ctx per help text (#1953) 2024-03-18 17:53:33 +02:00
725350d4ea whisper : set outputs from conv graph (#1959) 2024-03-16 17:30:55 +02:00
906c73b219 alloc : fix allocation data of pre-allocated leafs 2024-03-16 17:15:45 +02:00
00d80ff965 cmake : copy ggml-common.h to bin 2024-03-16 17:15:44 +02:00
1b553b9817 gitignore : .vimspector.json 2024-03-16 16:26:35 +02:00
de4d067f1e talk-llama : sync llama.cpp 2024-03-15 14:21:59 +02:00
e715f6a601 sync : ggml 2024-03-15 14:12:19 +02:00
f60ccfd83b update examples and tests 2024-03-15 14:01:14 +02:00
3753a2b2a8 ggml : add ggml-common.h 2024-03-15 14:01:14 +02:00
592dd25615 ggml : designate enum vals for integer types (llama/6050) 2024-03-15 14:01:14 +02:00
c8709d4604 metal : build metallib + fix embed path (llama/6015)
* metal : build metallib + fix embed path

ggml-ci

* metal : fix embed build + update library load logic

ggml-ci

* metal : fix embeded library build

ggml-ci

* ci : fix iOS builds to use embedded library
2024-03-15 14:01:14 +02:00
8932c2d6ce llama : add pipeline parallelism support (llama/6017)
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs

ggml-ci

* server : add -ub, --ubatch-size parameter

* fix server embedding test

* llama : fix Mamba inference for pipeline parallelism

Tested to work correctly with both `main` and `parallel` examples.

* llama : limit max batch size to n_batch

* add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
default increase to 4 (from 2)

changing this value may improve performance for some systems, but increases memory usage

* fix hip build

* fix sycl build (disable cpy_tensor_async)

* fix hip build

* llama : limit n_batch and n_ubatch to n_ctx during context creation

* llama : fix norm backend

* batched-bench : sync after decode

* swiftui : sync after decode

* ggml : allow ggml_get_rows to use multiple threads if they are available

* check n_ubatch >= n_tokens with non-casual attention

* llama : do not limit n_batch to n_ctx with non-casual attn

* server : construct batch with size of llama_n_batch

* ggml_backend_cpu_graph_compute : fix return value when alloc fails

* llama : better n_batch and n_ubatch comment

* fix merge

* small fix

* reduce default n_batch to 2048

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-15 14:01:13 +02:00
2bddfdd7c8 Update get version (llama/6025) 2024-03-15 14:01:13 +02:00
46e3c3f112 ggml : reuse quantum structs across backends (llama/5943)
* ggml : reuse quant blocks across backends

ggml-ci

* ggml : define helper constants only for CUDA and SYCL

ggml-ci

* ggml : define helper quantum constants for SYCL

ggml-ci
2024-03-15 14:01:13 +02:00
ef24ae0c7d ggml : fix UB in IQ2_S and IQ3_S (llama/6012) 2024-03-15 14:01:13 +02:00
a753926f02 sycl : update IQ1_S kernels (WIP - not working!) (llama/5995)
* sycl : try to fix after IQ1_S changes

* sycl : iq1s_grid -> iq1s_grid_gpu

* sycl : fix grid type
2024-03-15 14:01:13 +02:00
9dc60fc02d 1.5 bit: we can do even better (llama/5999)
* iq1_s: we can do even better

Spent one of the 4 scale bits on a signs of a 0.125 shift.
I.e., quants are now -1 + delta, delta, 1 + delta, where delta
is +/- 0.125.

CUDA works, same performance as before.
PPL(LLaMA-v2-7B) is now 11.85!

* iq1_s: make scalar and AVX2 work with the new version

* iq1_s: make Neon work with new version.

~10% drop in performance, so will need some more work.

* iq1_s: make Metal work with new version

* iq1_s: very slightly faster dequantize on Metal

* iq1_s: fix dequantize on the CPU

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-15 14:01:13 +02:00
d73a63629e ggml, ci : Windows ARM runner and build fixes (llama/5979)
* windows arm ci

* fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64

* fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned`

* fix `error C2065: '__fp16': undeclared identifier`
2024-03-15 14:01:13 +02:00
f79d0d4f74 Better 1.5 bit quantization (llama/5971)
* Trying blocvks of 16 for IQ1_S - seems slightly better

* iq1s_blocks16: Adjust scale fudge factor to 1.125

* iq1s_blocks16: going to blocks of 32

with 2048 lattice points, so same bpw.
This is even better than blocks of 16.
Should I try blocks of 64? But to keep the same
bpw, when I go to 4096 lattice points, I need to
remove blocks alltogether and just have superblocks of
256 weights.

* iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment

* iq1s_blocks16: scalar and AVX2 dot products

* iq1s_blocks16: CUDA dot product

* iq1s_blocks16: Metal works, Neon does not

Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s).
Not seeing the bug in the Neon implementation for now.

* iq1s_blocks16: fixed Neon

* iq1s_blocks16: very slightly faster TG on Metal

Still pathetic at 37 t/s

* iq1s_blocks16: speedup Metal by packing codebook into uint32_t's

* Formatting

* iq1s_blocks16: uint32_t codebook is also better in CUDA

TG-128 is now 204 t/s up from 194 t/s.
PP-512 is 5890 t/s, so significantly better than other quants

* iq1s_blocks16: slightly faster Neon dot product

* iq1s_blocks16: faster AVX2 dot product

* iq1s_blocks16: adjust to ggml-common.h

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-15 14:01:12 +02:00
4f88940ff6 Add q3_s and q1_s (llama/5886)
* Add q3_s and q1_s

* fix compilation

* fix build

* fix build

* fix build

* enable ops

* rm macro

* increase grid space
2024-03-15 14:01:12 +02:00
7bdb1de9ec metal : move mm_id indices to shared mem (llama/5982) 2024-03-15 14:01:12 +02:00
653d2e8ff9 ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) (llama/5951) 2024-03-15 14:01:12 +02:00
2fef660d0a ggml : remove old quantization functions (llama/5942)
* ggml : remove old quantization functions

ggml-ci

* ggml : simplify ggml_quantize_chunk

ggml-ci

* ggml : restrict correctness

ggml-ci

* ggml : remove hist data from the quantization API

ggml-ci

* tests : remove hist usage in test-backend-ops

ggml-ci

* vulkan : remove hist and fix typo
2024-03-15 14:01:12 +02:00
24eba5a2ff ggml : add ggml-common.h to deduplicate shared code (llama/5940)
* ggml : add ggml-common.h to shared code

ggml-ci

* scripts : update sync scripts

* sycl : reuse quantum tables

ggml-ci

* ggml : minor

* ggml : minor

* sycl : try to fix build
2024-03-15 14:01:12 +02:00
6e9d3aa32d llama : support Mamba Selective State Space Models (llama/5328)
* mamba : begin working on support for Mamba SSM

* mamba : begin figuring out how to (ab)use the kv cache for Mamba

* mamba : recurrent inference almost works, but incoherent

* mamba : recurrent inference WORKS!!!

* convert : optionally use d_conv and d_state from config.json for Mamba

* mamba : refactor recurrent conv, resulting in 20% perf increase

It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.

I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.

* ggml : parallelize ggml_exp

This results in 8% faster token generation for Mamba-130M.

* mamba : simplify the conv step with a self-overlapping view

Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.

Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.

Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).

* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32

Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.

* mamba : fix self-overlapping view depth stride

* mamba : handle batches of more than 1 token

This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.

Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.

* ggml: add ggml_ssm_scan to help with parallel selective scan

If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.

* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation

This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.

* mamba : very basic quantization support

Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)

Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.

Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.

* convert : fix wrong name for layer norm weight of offical Mamba models

I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")

* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator

This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.

However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.

* convert : for Mamba, also consider the "MambaLMHeadModel" arch name

It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json

* mamba : fix vocab size problems with official models

The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.

Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.

* ggml : remove ggml_exp and ggml_soft_plus

They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.

* mamba : remove some useless comments

No code change.

* convert : fix flake8 linter errors

* mamba : apply suggestions from code review

* mamba : remove unecessary branch for row-wise ssm_state and C multiplication

It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.

* ggml : in ggml_ssm_scan, use more appropriate asserts

* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32

* mamba : multiple sequences, but one at a time

This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).

The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)

Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.

* mamba : support llama_kv_cache_seq_cp

This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.

Each KV cell is dedicated to the sequence ID corresponding to its own index.

* mamba : use a state mask

It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.

inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).

* llama : replace the usage of n_ctx with kv_self.size in many places

* mamba : use n_tokens directly instead of n_tok

* mamba : in comments, properly refer to KV cells instead of slots

* mamba : reduce memory usage of ggml_ssm_scan

From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.

The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.

* mamba : simultaneous sequence processing

A batch can now contain tokens from multiple sequences.

This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.

However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.

* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba

This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).

Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.

Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.

* llama : add inp_s_seq as a new input tensor

The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.

The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.

Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).

* mamba : support llama_kv_cache_seq_cp copy chains

* mamba : support shifting and dividing the kv cache pos

* mamba : make the server and parallel examples work with whole sequences

A seq_id is dedicated to the system prompt in both cases.

* llama : make llama_kv_cache_seq_rm return whether it succeeded or not

* mamba : dedicate an input tensor for state copy indices

This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.

* mamba : adapt perplexity, batched, and batched-bench examples

* perplexity : limit the max number of sequences

This adapts to what the loaded model can provide.

* llama : add llama_n_max_seq to get the upper limit for seq_ids

Used by the perplexity example.

* batched : pass n_parallel to the model's context params

This should have been there already, but it wasn't.

* batched-bench : reserve sequences to support Mamba

* batched-bench : fix tokens being put in wrong sequences

Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.

* mamba : stop abusing attention metadata

This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.

This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
 will not require breaking existing converted Mamba models again)

* gguf-py : add new KV metadata key-value pairs for Mamba

* llama : add new metadata key-value pairs for Mamba

* llama : guard against divisions by zero when n_head is 0

* mamba : rename "unlimited" KV cache property to "recurrent"

* mamba : more correctly update the "used" field of the KV cache

* ggml : in ggml_ssm_scan, use a threshold for soft_plus

This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.

* convert : for Mamba, fallback to internal NeoX tokenizer

The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.

* mamba : support state saving and restoring

* ggml : implicitly pass src tensors through dst for Mamba-related ops

* mamba : clarify some comments

* server : fix cache_tokens not getting correctly resized

Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.

For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.

* convert-hf : support new metadata keys for Mamba

For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406

* mamba : rename metadata to be more similar to transformers library

This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".

* mamba : support mamba-*-hf models

These models share their token_embd.weight with their output.weight

* mamba : add missing spaces

This is purely a formatting change.

* convert-hf : omit output.weight when identical with token_embd.weight

Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.

* readme : add Mamba to supported models, and add recent API changes

* mamba : move state_seq and state_mask views outside layer loop

A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-15 14:01:12 +02:00
9ae0d18856 extra : update sync scripts after ggml-common.h 2024-03-15 14:00:53 +02:00
a56f435fd4 whisper : document whisper_batch.n_seq_id (#1942)
To prevent other people from attempting to remove it, as I did.
2024-03-10 16:55:22 +02:00
ec166499d8 whisper : improve beam search candidate diversity (#1947)
As of #1486, whisper.cpp uses a unified KV cache with KQ masking.
As a result, depending on their location in the batch,
identical sequences in a batch can have slightly different outputs
due to floating point rounding errors during reduction.
See the discussion in #1941 for more details.

The beam search code used "has identical sum of log probabilities"
as a shorthand for "is an identical token sequence". However, per above,
identical tokens do not necessarily result in identical probabilities.

Instead, explicitly compare on sequences.
This is linear in cost when they are identical,
but the lengths are always small and the comparisons are cheap.

This increases diversity during beam search.

This improves output quality for some short samples I've been working
with, at no detectable performance cost.
I haven't checked against larger corpuses.

Fixes #1941
2024-03-10 16:54:43 +02:00
ccf022f970 bindings/go : add linker flags to make metal work (#1944)
The first two are required to build.
The last one is to make it actually detect the GPU.

Fixes #1899, at least for me
2024-03-09 18:50:44 +02:00
2852e1af55 whisper : make beam candidate sort more stable (#1943)
All else being otherwise equal, this encourages the beam candidate
selection to re-use the same decoder, which slightly
reduces the cache size.

I wouldn't expect it to make much of a performance difference,
but it helps when debug printing the cache and beam.

Added as part of understanding #1941.
2024-03-09 18:50:03 +02:00
ce945b50c3 ggml : try fix 32-bit arm compat (#1938)
* ggml : try fix 32-bit arm compat

* ggml : fix cont
2024-03-08 23:45:07 +02:00
2f5a5a66dd talk-llama : use llama_decode instead of llama_eval 2024-03-08 12:04:43 +02:00
8e409d1113 talk-llama : sync llama.cpp 2024-03-08 11:55:50 +02:00
05d1b61af4 talk-llama : sync llama.cpp 2024-03-08 11:52:47 +02:00
647cae178a sync : ggml 2024-03-08 11:39:34 +02:00
bae7c23fbf Revert "[SYCL] fix error when set main gpu to non-zero (llama/5901)" (llama/5918)
This reverts commit ceca1aef0738b57951cd12c603c3477e75312dec.
2024-03-08 11:38:33 +02:00
18ea187d42 fix error when set main gpu to non-zero (llama/5901)
* fix error when set main gpu to non-zero

* fix delete condition
2024-03-08 11:38:33 +02:00
1daeffca54 ggml : use SYS_get_cpu if SYS_getcpu is not defined (llama/5906)
Fixes #5694
Fixes ggerganov/whisper.cpp#1894
2024-03-08 11:38:33 +02:00
2f6f1d4465 ggml : use uint8x16_t return type for ggml_vqtbl1q_u8 (llama/5894)
* use uint8x16_t

* Update ggml-quants.c
2024-03-08 11:38:33 +02:00
7ff1894c34 add wait() to make code stable (llama/5895) 2024-03-08 11:38:33 +02:00
8edfc54c2b quants : use MM256_SET_M128I consistently to fix gcc 7 build (llama/5889) 2024-03-08 11:38:33 +02:00
9c399689ec Vulkan Improvements (llama/5835)
* Improve dequant shaders, add fast q4_0 dequant

* Optimize dmmv non-kquants for GCN

Remove unnecessary SPIR-V shader duplication

* Fix q4_0 dequant dispatch sizes

Fix backend free bug

* Optimize dequant shaders for q4_1, q5_0, q5_1 and q8_0

* Add unary and binary op shader templates

* Fix Vulkan check results

* Enable non-contiguous support for simple ops

* Add argsort

Basic q4_0 mmq shader and unit test

* Speed up q4_0 dequant code, enable mmq for q4_0

* Rework matmul pipeline selection

* Add soft_max alibi support

* Add q4_1, q5_0, q5_1 and q8_0 dequant mat mat mul shaders

* Add environment variable GGML_VK_FORCE_MAX_ALLOCATION_SIZE to limit max buffer size

Rename GGML_VULKAN_DISABLE_F16 to GGML_VK_DISABLE_F16 for consistency
2024-03-08 11:38:33 +02:00
9d9a405cfd fix mul_mat fault in CI/unit-test (llama/5862)
* fix mul_mat fault in cpy_f32_f16

* rm unused function

* add wait() for memcpy

* restore ci/run.sh, rename struct defination, fix bug in ggml_sycl_op_mul_mat_sycl

* fix format issue

* llama : fix segfault from unknown model arch name (llama/5820)

* llama : fix segfault from unknown model arch name

* llama : make all LLM maps const

This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.

* llama : name LLM_ARCH_UNKNOWN to "(unknown)"

This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284

* llama : remove redundant inner const for LLM_TENSOR_NAMES

The extra const won't do anything here as const maps
return const references to values.

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* llama : remove redundant nullptr check in llm_arch_from_string

Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* llama : refactor internal quantization functions (llama/5830)

* scripts : add pod-llama.sh

* ggml : IQ3_S improvements (llama/5829)

* iq3_s: somewhat faster AVX2 dot product

On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.

* iq3_s: somewhat faster ARM_NEON dot product

Still dog slow - 10.7 t/s up from 9.9 t/s.

* iq3_s: another small ARM_NEON improvement

10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.

* iq3_s: minor improvement on Metal

49.4 t/s -> 50.3 t/s

* iq3_s: PPL improvement

E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.

* iq3_s: use new grid everywhere

* Fix ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* convert-hf : make model class definitions self-contained (llama/5825)

* convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (llama/5821)

* ggml : fix IQ3_S AVX implementation (llama/5834)

ggml-ci

* llama : add abort_callback to interrupt computation (llama/5409)

* using abort_callback from ggml to stop llama computation

* format fix

* a brief explaining comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: tests: passkey challenge /  self-extend with context shift demo (llama/5832)

* server: tests: add models endpoint scenario

* server: /v1/models add some metadata

* server: tests: add debug field in context before scenario

* server: tests: download model from HF, add batch size

* server: tests: add passkey test

* server: tests: add group attention params

* server: do not truncate prompt tokens if self-extend through group attention is enabled

* server: logs: do not truncate log values

* server: tests - passkey - first good working value of nga

* server: tests: fix server timeout

* server: tests: fix passkey, add doc, fix regex content matching, fix timeout

* server: tests: fix regex content matching

* server: tests: schedule slow tests on master

* server: metrics: fix when no prompt processed

* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1

* server: tests: increase timeout for completion

* server: tests: keep only the PHI-2 test

* server: tests: passkey add a negative test

* flake.lock: Update (llama/5842)

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
  → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* server : init http requests thread pool with --parallel if set (llama/5836)

* ci : schedule slow server tests only on Release or on demand (llama/5839)

* llama : fix llama_copy_state_data with fragmented KV cache (llama/5840)

The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.

Existing session files should still work.

* llama : fix llama_kv_cache_cell_max inability to return 1

I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.

* llama : fix state size calculation

Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.

* gguf-dump : support i-quants (llama/5841)

Co-authored-by: Black_Fox <radekliska@gmail.com>

* llama : allow for user specified embedding pooling type (llama/5849)

* allow for user specified pooling type

* llama : use enum types over int

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* readme : add API changes section

* cuda : fix data race in soft max (llama/5853)

* main : support special tokens as reverse/anti prompt (llama/5847)

* Support special tokens as reverse/anti prompt.

* Tokenize antiprompts only once.

* main : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* common : use LLAMA_DEFAULT_SEED (llama/5855)

* add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)

* cuda: fix group_norm

* cuda: add batch inference support for ggml_pad/ggml_upscale

* add ggml_arrange

* add ggml_timestep_embedding

* update ggml_arange/ggml_timestep_embedding tests

* cuda: fix im2col

* add ggml_arange/ggml_timestep_embbeding support for metal backend

* fix some bugs

* fix some bugs

* Update ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* modify according to the review comments

* ggml : fix compile warnings + code style

* ggml : normalize compute_forward calls + fix seg fault in debug

* minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>

* sync : ggml

* add alias for chat template (llama/5858)

* speculative : implement stochastic speculative sampling (llama/5625)

* (WIP) Implement stochastic speculative decoding

* sample from residual distribution on draft accept failure

* fix #5657: force greedy sampling with probs when temp is 0

* remove p_accept parameter

* fix style

* remove unused variables

* add srand() in speculative.cpp

* replace use of rand() with mt19937 sampling

* fixes based on review (@JohannesGaessler)

* fix r random generation

* randomly select next sequence to verify + fix bug in memory freeing

* fix bug in active_seqs sync

* fix uniform int distribution initialization

* remove warnings from comparison between int and size_t

* check grammar in `llama_sample_probability_distribution_impl`

* remove malloc code by utilizing vectors

* add PR link to README

* cmake : handle cases where git index is not found in .git (llama/5844)

* Update CMakeLists.txt

* Update CMakeLists.txt

* ggml : introduce ggml_status (ggml/750)

* using enum as an exit code instead of macros

* update return type from enum to unsigned int

* indentation fix

* compound update
ggml_compute_exit_code -> ggml_status
changed ggml_status from a bit-field type to simple codes
ggml_status to string cast

* ggml_status to string cast

* GGML_CALL was removed

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* sync : ggml

ggml-ci

* ggml : fix unknown status (llama/0)

* flake : fix

* llama : fix embeddings (llama/5796)

* llama : fix embeddings

ggml-ci

* llama : do not use KV cache for non-causal models

ggml-ci

* embeddings : fix llama_batch_init arg

* llama : add pooling switch

* llama : distinguish token vs sequence embeddings

ggml-ci

* llama : assert pooling tensor

* llama : simplify causal mask condition

ggml-ci

* llama : assert input batch with pooling enabled

* readme : update API changes list

* nix: static build (llama/5814)

* fix speculative decoding build on windows (llama/5874)

* rebase and rm tailing space

---------

Co-authored-by: LiangtaoJin <liang-tao.jin@intel.com>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>
Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Nindaleth <Nindaleth@users.noreply.github.com>
Co-authored-by: Black_Fox <radekliska@gmail.com>
Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: DAN™ <dranger003@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
Co-authored-by: Minsoo Cheong <54794500+mscheong01@users.noreply.github.com>
Co-authored-by: Dane Madsen <dane_madsen@hotmail.com>
Co-authored-by: hutli <6594598+hutli@users.noreply.github.com>
Co-authored-by: Jeffrey Quesnelle <emozilla@nousresearch.com>
2024-03-08 11:38:32 +02:00
edd8b38a75 ggml : fix unknown status (llama/0) 2024-03-08 11:38:32 +02:00
ed76818700 whisper : fix compute helper return (ggml/750) 2024-03-08 11:38:32 +02:00
9a0b59d990 ggml : introduce ggml_status (ggml/750)
* using enum as an exit code instead of macros

* update return type from enum to unsigned int

* indentation fix

* compound update
ggml_compute_exit_code -> ggml_status
changed ggml_status from a bit-field type to simple codes
ggml_status to string cast

* ggml_status to string cast

* GGML_CALL was removed

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-08 11:38:32 +02:00
93a84a143b cuda : fix data race in soft max (llama/5853) 2024-03-08 11:38:32 +02:00
bd26876267 ggml : fix IQ3_S AVX implementation (llama/5834)
ggml-ci
2024-03-08 11:38:32 +02:00
21d295180d ggml : IQ3_S improvements (llama/5829)
* iq3_s: somewhat faster AVX2 dot product

On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.

* iq3_s: somewhat faster ARM_NEON dot product

Still dog slow - 10.7 t/s up from 9.9 t/s.

* iq3_s: another small ARM_NEON improvement

10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.

* iq3_s: minor improvement on Metal

49.4 t/s -> 50.3 t/s

* iq3_s: PPL improvement

E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.

* iq3_s: use new grid everywhere

* Fix ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-08 11:38:32 +02:00
c3bfc9bfda Support multiple GPUs (split mode) on SYCL backend (llama/5806)
* suport multiple cards: split-mode - layer|row

* rm warning

* rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test

* update news

* fix merge error

* update according to review comments
2024-03-08 11:38:32 +02:00
422a6b16fc ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (llama/5813) 2024-03-08 11:38:32 +02:00
11dd0d4482 Use batched mul_mat pathway (llama/5591)
* Use batched mul_mat pathway

* rm extra line

* Explicitly state scaled data type

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-03-08 11:38:31 +02:00
Eve
26dd2f06ac make portability_enumeration_ext apple only (llama/5757) 2024-03-08 11:38:31 +02:00
8cee7c08b6 add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)
* cuda: fix group_norm

* cuda: add batch inference support for ggml_pad/ggml_upscale

* add ggml_arrange

* add ggml_timestep_embedding

* update ggml_arange/ggml_timestep_embedding tests

* cuda: fix im2col

* add ggml_arange/ggml_timestep_embbeding support for metal backend

* fix some bugs

* fix some bugs

* Update ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* modify according to the review comments

* ggml : fix compile warnings + code style

* ggml : normalize compute_forward calls + fix seg fault in debug

* minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-03-08 11:38:31 +02:00
2e2626b167 examples : Auto lowercase language parameter in main.cpp (#1928)
* Auto lowercase language parameter

* Update examples/main/main.cpp

Co-authored-by: bobqianic <129547291+bobqianic@users.noreply.github.com>

---------

Co-authored-by: bobqianic <129547291+bobqianic@users.noreply.github.com>
2024-03-06 22:25:10 +00:00
c0c0ae2dea examples : fix typo in bench.cpp (#1933) 2024-03-06 22:21:44 +00:00
897412b5b6 whisper : fix typo (#1925) 2024-03-05 17:06:31 +02:00
f22d27a385 whisper.android.java : fix returns in JNI (#1929) 2024-03-05 15:59:26 +02:00
ccd7c1d2da cmake : add library versioning (#1352)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-04 21:17:48 +02:00
c713eb5e2a readme : recommend MacOS Sonoma for Core ML (#1917) 2024-03-04 21:16:13 +02:00
25d313b38b talk-llama : sync llama.cpp 2024-02-28 13:04:05 +02:00
3168dbf23b sync : ggml 2024-02-28 13:01:33 +02:00
1711bb3881 sync : llama.cpp (ggml/0) 2024-02-28 13:00:30 +02:00
2533305596 ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (llama/5760)
* WIP: make i-quants work for QK_K = 64

* iq2_xs: attempt to fix AVX dot product for QK_K = 64

Tests pass, but I get gibberish.

* QK_K = 64 tests pass on ARM_NEON and Metal

Sadly, that does not mean it actually works.

* Make CUDA compile with QK_K = 64

Tests don't pass, plus we get misaligned access

* Q2_K: fixed bug in imatrix quantization for QK_K = 64

* iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-28 13:00:30 +02:00
0eca512ac8 Attempt to fix android build (llama/5752)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-28 13:00:30 +02:00
013e394a4b IQ4_XS: a 4.25 bpw quantization (llama/5747)
* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-28 13:00:29 +02:00
d83f371b5f cuda : replace remaining shfl_xor with calls to warp_reduce functions (llama/5744) 2024-02-28 13:00:29 +02:00
1c71816eab ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (llama/5742) 2024-02-28 13:00:29 +02:00
7b1d8ea7e0 Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (llama/5721)
* Adding IQ2_S and IQ2_M as a single cumulative commit

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-28 13:00:29 +02:00
b1f7223a0a CUDA: fix DEBUG_CUDA_MALLOC (llama/5729) 2024-02-28 13:00:29 +02:00
8408a4be8e Add support for soft_max ALiBi (llama/5639)
* Add support for bias

* Update pre-processor

* rm commented code

* fix format

* fix CI

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-28 13:00:29 +02:00
72849c24ba ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (llama/5711)
* [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility

vqtbl1q_u8 is not part of arm v7 neon library

* [android-example] Remove abi filter after arm v7a fix

* [github-workflows] Do not skip Android armeabi-v7a build
2024-02-28 13:00:28 +02:00
c19c28be71 add google magika inference example (ggml/748)
* add magika inference example

* ggml : fix unaligned accesses in custom ops

* ggml : fix FP32 GELU for values that exceed the FP16 range

* use ggml_pool_1d

* add README

* Update README.md

* pad inputs if the files are too small

* cleanup

ggml-ci
2024-02-28 13:00:28 +02:00
0d8fd8483a stream.wasm : fix invalid memory access when no segments (#1902)
No segments may be returned when a smaller sample buffer (EG 2048 samples) is sent to the worker.
2024-02-26 10:12:35 +02:00
3170841ed9 talk-llama : sync llama.cpp 2024-02-25 20:00:10 +02:00
7a6e385c1b sync : ggml 2024-02-25 19:59:34 +02:00
578e47e70c sync : llama.cpp (ggml/0) 2024-02-25 19:58:46 +02:00
fac5b43830 code : normalize enum names (llama/5697)
* coda : normalize enum names

ggml-ci

* code : cont

* code : cont
2024-02-25 19:58:46 +02:00
9e7c5212a1 IQ3_S: a much better alternative to Q3_K (llama/5676)
* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* Resurrecting iq3_xs

After all the experimentation, nothing was better than this.

* Minor PPL improvement via a block scale fudge factor

* Minor improvement via 3 neighbours

* iq3_xs: working scalar and AVX2 dot products

* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)

* iq3_xs: working Metal implementation

* Adding IQ3_M - IQ3_XS mix with mostly Q4_K

* iiq3_xs: a 3.4375 bpw variant

* iq3_xs: make CUDA work for new version

* iq3_xs: make scalar and AVX2 work for new version

* iq3_s: make ARM_NEON work with new version

* iq3_xs: make new version work on metal

Performance is very similar to Q3_K_S

* iq3_xs: tiny Metal speed improvement

* iq3_xs: tiny Metal speed improvement

* Fix stupid warning

* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS

* iq3_xs: rename to iq3_s

* iq3_s: make tests pass

* Move Q3_K_XS mix to 3.25 bpw

* Attempt to fix failing tests

* Another attempt to fix the Windows builds

* Attempt to fix ROCm

* ROCm again

* iq3_s: partial fix for QK_K = 64

* iq3_s: make it work on metal for QK_K = 64

Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.

* Will this fix ROCm?

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-25 19:58:46 +02:00
1cb64f7368 Introduce backend GUIDs (ggml/743)
* Introduce backend GUIDs

Initial proposed implementation of backend GUIDs
(Discussed in https://github.com/ggerganov/ggml/pull/741)

Hardcoded CPU backend GUID (for now)
Change ggml_backend_is_cpu logic to use GUID

* Remove redundant functions

Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion

* Add spaces to match style

Co-authored-by: slaren <slarengh@gmail.com>

* Fix brace style to match

Co-authored-by: slaren <slarengh@gmail.com>

* Add void to () in function signature

Co-authored-by: slaren <slarengh@gmail.com>

* Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid

* add guids to all backends

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-25 19:58:45 +02:00
f18738f247 talk, talk-llama : pass text_to_speak as a file (#1865)
* talk-llama: pass file instead of arg

it is too hard to quote text in a portable way

* talk-llama: pass heard_ok as a file

* talk-llama: let eleven-labs.py accept options

Options: -v voice, -s savefile, -p (--play)

* talk-llama: check installed commands in "speak"

Pass "-q" to eleven-labs.py to skip checking whether elevenlabs is installed

* talk-llama: pass voice_id again

in order to sync talk with talk-llama

* talk: sync with talk-llama

Passing text_to_speak as a file is safer and more portable
cf. https://stackoverflow.com/a/59036879/45375

* talk and talk-llama: get all installed voices in speak.ps1

* talk and talk-llama: get voices from api

* talk and talk-llama: add more options to eleven-labs.py

and remove DEFAULT_VOICE because it is deprecated (https://www.reddit.com/r/ElevenLabs/comments/1830abt/what_happened_to_bella/)

```
usage: eleven-labs.py [-q] [-l] [-h] [-n NAME | -v NUMBER] [-f KEY=VAL] [-s FILE | -p] [TEXTFILE]

options:
  -q, --quick           skip checking the required library

action:
  TEXTFILE              read the text file (default: stdin)
  -l, --list            show the list of voices and exit
  -h, --help            show this help and exit

voice selection:
  -n NAME, --name NAME  get a voice object by name (default: Arnold)
  -v NUMBER, --voice NUMBER
                        get a voice object by number (see --list)
  -f KEY=VAL, --filter KEY=VAL
                        filter voices by labels (default: "use case=narration")
                        this option can be used multiple times
                        filtering will be disabled if the first -f has no "=" (e.g. -f "any")

output:
  -s FILE, --save FILE  save the TTS to a file (default: audio.mp3)
  -p, --play            play the TTS with ffplay
```

* examples: add speak_with_file()

as suggested in the review

* talk and talk-llama: ignore to_speak.txt
2024-02-24 09:24:47 +02:00
a0ddd8392c whisper : add SYCL support (#1863)
* add changes from llama upstream

* add sycl abstraction

* add sycl build

* update cmake

* add sycl build config

* fix bug

* fix bug

* refactor build

* fix bug

* update build

* call build

* use sycl header

* add examples

* add target

* fix typecast in quant.c

* readd fp16 and readme

* fix quant typecast

* add sample

* add readme

* remove cxx file check
2024-02-23 09:22:24 +02:00
a2506909b1 talk-llama : sync llama.cpp 2024-02-22 23:30:53 +02:00
7b1ff212d9 sync : ggml 2024-02-22 23:25:38 +02:00
e5d06cfc0f ggml : always define ggml_fp16_t as uint16_t (llama/5666)
* ggml : always define ggml_fp16_t as uint16_t

ggml-ci

* ggml : cont

ggml-ci

* ggml : cont

* ggml : cont

ggml-ci

* ggml : cont

ggml-ci

* cuda : no longer ggml headers last

ggml-ci

* ggml : fix q6_K FP16 -> FP32 conversion

ggml-ci

* ggml : more FP16 -> FP32 conversion fixes

ggml-ci
2024-02-22 23:25:33 +02:00
31891db2e3 ci : fix whitespace 2024-02-22 20:20:34 +02:00
5fdb27ff80 ggml : 32-bit arm compat (#1891)
* ggml : 32-bit arm compat

* ggml : add ggml_vqtbl1q_s8 impl

* ggml : cont
2024-02-22 18:31:40 +02:00
6b16927d18 sync : ggml 2024-02-22 15:15:38 +02:00
ce411498f6 sync : llama.cpp (ggml/0)
ggml-ci
2024-02-22 15:12:36 +02:00
208de95ac7 conext add name (llama/5624)
* [SYCL] conext add name

* name should start with SYCL*
2024-02-22 15:12:36 +02:00
c2ce39c795 Update ggml_sycl_op_mul_mat_vec_q (llama/5502)
* Update ggml_sycl_op_mul_mat_vec_q

* Apply suggestions from code review

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* revert suggestion on macro

* fix bug

* Add quant type GGML_TYPE_IQ1_S to unsupported

* fix format

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-22 15:12:36 +02:00
8daa534818 Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init() 2024-02-22 15:12:36 +02:00
9fca69b410 Add check for VK_KHR_portability_enumeration for MoltenVK support 2024-02-22 15:12:36 +02:00
b26c645420 Add preprocessor checks for Apple devices.
Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files
2024-02-22 15:12:36 +02:00
1879ec556e Resolve ErrorIncompatibleDriver with Vulkan on MacOS.
Refs:
- https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476
2024-02-22 15:12:35 +02:00
c6e53cfc46 Allow for Vulkan build with Accelerate.
Closes #5304
2024-02-22 15:12:35 +02:00
b19f2fb815 cuda : ignore peer access already enabled errors (llama/5597)
* cuda : ignore peer access already enabled errors

* fix hip
2024-02-22 15:12:35 +02:00
a6b0950916 ggml : compute forward no longer pass src tensors (ggml/729)
* refactored compute forward to not pass in the src tensors each time

* fix merge issues with flags

* missed one place in the last commit to fix the is_param / flags issue

* minor spacing fix

* fixed some variable assignments so all tests locally are passing

* new change after merge fix

---------

Co-authored-by: siddharthvader <siddharth@coinlist.co>
2024-02-22 15:12:35 +02:00
d352dbd163 ggml : fix conv_2d batch mode (ggml/737)
Co-authored-by: bssrdf <bssrdf@gmail.com>
2024-02-22 15:12:32 +02:00
eb23f4ef16 openvino : fix convert-whisper-to-openvino.py (#1890)
Fix issue: Conversion from Whisper to OpenVino failed #1870

convert-whisper-to-openvino.py stopped working with OpenVINO version 2023.0.0-10926-b4452d56304-releases/2023/0 .

Error was: TypeError: load(): incompatible function arguments. The following argument types are supported:
    1. (self: openvino._pyopenvino.FrontEnd, path: object) -> ov::frontend::InputModel

Tested successfully with a large-v3 conversion.

Co-authored-by: Stefan Grundmann <grundmanns@sandiego.gov>
2024-02-22 15:11:35 +02:00
c56344b509 main : fix file existence check in main.cpp (#1889)
In commit dda4b0e of PR #1872, I've introduced a check for the
existence of files before loading the model. However, I haven't
considered the case where whisper.cpp might read from stdin as well,
and in such cases, the checks should ignore the "-" argument as it
does not represent a regular file.

Additionally, this commit removes the usage of 'stat()' in favor of
the recently introduced function 'is_file_exist()' in common.cpp from
PR #1871.

Apologies for the bug introduced in the previous PR and any
inconvenience it may have caused.
2024-02-22 15:01:08 +02:00
59119f4f20 talk-llama : sync llama.cpp 2024-02-20 12:09:57 +02:00
276615d708 make : fix CUBLAS link with WSL (#1878) 2024-02-20 12:05:38 +02:00
b602819b6e sync : ggml 2024-02-19 15:54:25 +02:00
c2c606f05b ggml : resolve merge conflicts (ggml/0)
ggml-ci
2024-02-19 15:53:25 +02:00
83afebe872 common : add IQ1_S (ggml/0)
ggml-ci
2024-02-19 15:53:25 +02:00
a4d8f9d559 ci : enable -Werror for CUDA builds (llama/5579)
* cmake : pass -Werror through -Xcompiler

ggml-ci

* make, cmake : enable CUDA errors on warnings

ggml-ci
2024-02-19 15:53:24 +02:00
5ec1e0edfa cuda, metal : fix nans in soft_max (llama/5574)
* cuda : fix nans in soft_max

* metal : fix nans in soft_max

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-19 15:53:24 +02:00
30a11b1ab8 ggml : android and old glibc NUMA incompatibility bugfixes (llama/5557)
* #ifdef out some code NUMA blocks for Android due to lack of support

* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper

* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc

* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways

---------

Co-authored-by: root <root@nenya.lothlorien.ca>
2024-02-19 15:53:24 +02:00
f04e6b87d7 ggml : restore vec dot stride arg names (llama/5453) 2024-02-19 15:53:24 +02:00
0c33928b55 ci : fix wikitext url + compile warnings (llama/5569)
ggml-ci
2024-02-19 15:53:24 +02:00
0775374750 metal : fix unused warnings (llama/0) 2024-02-19 15:53:24 +02:00
7d90bb035b ggml, common, examples, tests : fixed type arguments in printf (llama/5528) 2024-02-19 15:53:24 +02:00
2c1ad21ba8 1.5 bit quantization (llama/5453)
* iq1_s: WIP basics

* iq1_s: CUDA is working

* iq1_s: scalar CPU dot product

* iq1_s: WIP AVX2 dot product - something is not right

* Fix tests

* Fix shadow warnings

* Fix after merge with latest master

* iq1_s: AVX2 finally works

* iq1_s: ARM_NEON dot product. Works, but not very fast

* iq1_s: better grid

* iq1_s: use IQ2_XXS for attn_output

At a cost of 0.04 extra bpw this gives a big improvement in PPL.

* iq1_s: Metal basics

Dequantize works, but not dot product

* iq1_s: Metal works, but quite slow

As usual, Apple Silicon does not like the code I write.

* iq1_s: Tests

* iq1_s: slightly faster dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-19 15:53:23 +02:00
eca5ff9868 ggml : add ALiBi support for ggml_soft_max_ext (llama/5488) 2024-02-19 15:53:23 +02:00
1b25d2fa0a ci : add an option to fail on compile warning (llama/3952)
* feat(ci): add an option to fail on compile warning

* Update CMakeLists.txt

* minor : fix compile warnings

ggml-ci

* ggml : fix unreachable code warnings

ggml-ci

* ci : disable fatal warnings for windows, ios and tvos

* ggml : fix strncpy warning

* ci : disable fatal warnings for MPI build

* ci : add fatal warnings to ggml-ci

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-19 15:53:23 +02:00
74a6acc999 cmake : fix VULKAN and ROCm builds (llama/5525)
* cmake : fix VULKAN and ROCm builds

* cmake : fix (cont)

* vulkan : fix compile warnings

ggml-ci

* cmake : fix

ggml-ci

* cmake : minor

ggml-ci
2024-02-19 15:53:23 +02:00
a4ed8a0821 ggml : add numa options (llama/5377)
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverted Makefile

* Fixed include

* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables

* removed trailing whitespace

* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverting Makefile

* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet

* Removing MIRROR_MODE code for this PR

* Removing last bit of MIRROR_MODE code for this PR

* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static

* Fixed lingering init_llama_backend() bool calls in tests and examples

* Remote enum llama_numa_strategies

* Revert bad merge with dynatemp flags

* add missing enum ggml_numa_strategies declaration and revert sync problem with master

* add missing enum ggml_numa_strategies declaration

* fixed ggml_init_numa variable

* Update ggml.h

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges

* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples

* Fix up some boolean vs enum comparisons

* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype

* Update ggml.h

Align enum values

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml.c

Remove whitespace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml.c

align paremeters

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/server/server.cpp

remove whitespace and align brace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/common.cpp

Remove whitespace and align brace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example

* Update ggml.c

simplified return for platforms without NUMA support

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* removed redundant else from cli argument processing of --numa

* whitespace

---------

Co-authored-by: root <root@nenya.lothlorien.ca>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-02-19 15:53:23 +02:00
9f675e021c cuda : print message when initialization fails (llama/5512)
* cuda : print message when initialization fails

* use CUDA_NAME both times
2024-02-19 15:53:23 +02:00
a38efcb9fd vulkan: Find optimal memory type but with fallback (llama/5381)
* @0cc4m feedback

* More feedback @0cc4m
2024-02-19 15:53:22 +02:00
AT
31591649a0 Early return for zero size calls to get_tensor. (llama/5482)
* Early return for zero size calls to get_tensor.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add an early return to the get/set tensor when the size is null.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

* Early return after the assertions.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

* Since we do the early return in the generic backend now no reason to do so here as well.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

---------

Signed-off-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-19 15:53:22 +02:00
4f5c46a84f ggml-quants : fix compiler warnings (shadow variable) (llama/5472)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-19 15:53:22 +02:00
462ffc58db ggml-sycl: Replace 3d ops with macro (llama/5458)
* use macro

* use macro

* fix format
2024-02-19 15:53:21 +02:00
65faae0b6a build : update CBLAS flags + fix unused var warning (#0) 2024-02-19 14:44:46 +02:00
dda4b0ed06 main : check if input files exist before proceeding (#1872)
Until the most recent commit (3d42463), the main.cpp sample file does
not check whether the input files exist or not. Consequently, the
model is loaded first before reporting whether there was a failure or
not when processing a file. In environments with HDD, this can take
about 50 seconds or more, depending on the loaded model.

This commit addresses this issue by checking in advance whether the
input files exist or not.
2024-02-19 10:51:26 +02:00
07d04280be examples : clean up common code (#1871)
move some utility functions into common.h
2024-02-19 10:50:15 +02:00
917c56ded4 models : fix openvino setup info (#1874) 2024-02-19 02:19:47 +00:00
3d42463845 models : add update py requirements 2024-02-13 11:51:32 +02:00
3ffc83d90a swift : package no longer use ggml dependency (#1861)
* Revert "swift : update Package.swift to use ggml as package dependency (#1701)"

This reverts commit 993acb5d41.

* spm : add ggml.h
2024-02-12 19:54:11 +02:00
e3c5e2cba8 whisper : fix external encoder (#1860) 2024-02-12 19:53:51 +02:00
b742f13e70 sync : ggml 2024-02-12 19:07:56 +02:00
52c529eeb1 ggml-alloc : allocate all leafs as if they were inputs (ggml/731)
* ggml-alloc : allocate all leafs as if they were inputs

* ensure static leafs are allocated

* gpt-2-backend : remove unnecesary ggml_new_tensor

* update other gpt-2 examples to remove ggml_new_tensor calls in the graph
2024-02-12 19:07:38 +02:00
551529290d talk-llama : sync llama.cpp 2024-02-12 10:39:58 +02:00
25a90ffa38 sync : ggml 2024-02-12 09:32:15 +02:00
866b67ca93 ggml-backend : sync remnant 2024-02-12 09:31:12 +02:00
d7e9f58f7f CUDA: mul_mat_vec_q tiling, refactor mul mat logic (llama/5434)
* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-12 09:31:12 +02:00
04839bae22 vulkan: only use M-sized matmul on Apple GPUs (llama/5412)
* vulkan: refactor guess_matmul_pipeline for vendor

Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.

Signed-off-by: Sergio Lopez <slp@redhat.com>

* vulkan: only use M-sized matmul on Apple GPUs

L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.

Signed-off-by: Sergio Lopez <slp@redhat.com>

---------

Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-02-12 09:31:12 +02:00
3cc6e04a52 ggml : fix compile warnings (unused vars) (llama/4966) 2024-02-12 09:31:11 +02:00
b7ef178b9c ggml : add mmla kernels for quantized GEMM (llama/4966)
* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info
2024-02-12 09:31:11 +02:00
47dfe9d4db metal : use autoreleasepool to avoid memory leaks (llama/5437)
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436
2024-02-12 09:31:11 +02:00
1d3270cc8f ggml-alloc : v3 (ggml/727)
* ggml-alloc v3

ggml-ci

* fix ci

ggml-ci

* whisper : check for backend buffer allocation failures

* whisper : avoid leaks when initialization fails

* cleanup

ggml-ci

* style fixes

ggml-ci
2024-02-12 09:31:11 +02:00
a6fb6ab597 examples : added audio_ctx argument to main and server (#1857)
* added audio_ctx argument to main and server examples

* Better default value

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* better default value (again)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-12 09:19:07 +02:00
163e74b6c3 metal : option to embed MSL source into compiled binary (#1842)
* ggml : embed Metal library source (ggml-metal.metal) into binary

enable by setting WHISPER_EMBED_METAL_LIBRARY

* rename the build option

* rename the preprocessor directive

* generate Metal library embedding assembly on-fly during build process
2024-02-11 16:41:41 +02:00
f273e66dc6 examples : initialize context params properly (#1852) 2024-02-11 16:39:12 +02:00
02b4c52c12 talk-llama : sync llama.cpp 2024-02-10 10:10:59 +02:00
518199c09e sync : ggml 2024-02-10 09:56:47 +02:00
8b17a2f776 src : relocate new backend sources 2024-02-10 09:55:47 +02:00
b6d2827914 ggml : fix error C2078: too many initializers for MSVC ARM64 (llama/5404) 2024-02-10 09:55:47 +02:00
9711bae0b3 CUDA: more warps for mmvq on NVIDIA (llama/5394) 2024-02-10 09:55:47 +02:00
eec38f63bd CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386) 2024-02-10 09:55:47 +02:00
ef5e6b746f Basic Vulkan Multi-GPU implementation (llama/5321)
* Initial Vulkan multi-gpu implementation

Move most global variables into backend context

* Add names to backend device functions

* Add further missing cleanup code

* Reduce code duplication in tensor split layer assignment

* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h

* Only do device info print in the beginning and initialize one backend for cpu assist

Add missing cleanup code

* Rework backend memory management to make sure devices and buffers get properly allocated and freed

* Rename cpu assist free function

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-10 09:55:47 +02:00
77bf6b5f56 CUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370) 2024-02-10 09:55:47 +02:00
b562fff9d0 Slight quantization improvement for Q4_K and Q5_K (llama/5361)
* Q4_K: slightly better quantization

* Q5_K: slightly better quantization

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-10 09:55:47 +02:00
b5dec374f4 CUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351) 2024-02-10 09:55:47 +02:00
fa0dc6167c ggml : make use of ggml-quants.h possible in C++ code (llama/5338)
* Make use of ggml-quants.h possible in C++ code

* One cannot possibly be defining static_assert in a C++ compilation

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-10 09:55:47 +02:00
55bcd62a4b ggml : avoid duplicating function calls using MIN/MAX macros (llama/5325)
* Avoid duplicating function calls when using MIN/MAX macros.

Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:

https://godbolt.org/z/Ee4KMrvKh

Code behaves exactly the same.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-10 09:55:46 +02:00
0ed762d691 iq2_xxs: tune quantization (llama/5320)
We get slightly better PPL, and we cut quantization time in
nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-10 09:55:46 +02:00
1b5bb7792e cuda : fix LLAMA_CUDA_F16 (llama/5262) 2024-02-10 09:55:46 +02:00
9b735cea77 metal : add im2col F32 dst support (llama/5132) 2024-02-10 09:55:46 +02:00
12c462d656 llava : add MobileVLM support (llama/5132)
* New Feature:
    1. Sum_Rows:
        fix cuda kernel overflow
        fix block shape error when nrows too big
    2. Im2Col:
        Support Batch in cuda
        Support f32 to f32 both in cpu && cuda
    3. DepthWiseConv:
        Support by Im2Col && MulMat
    4. Pool_2d:
        Supoort avg pooling in cuda
    5. HardSigmoid:
        Imp in cuda
    6. HardSwish:
        Imp in cuda

* fix tabs instead of spaces

* code clean

* CUDA POOL2D

* ADD POOL2D test case in test-backend-ops.cpp

* code clean

* fix pool2d_kernel

nits

* fix bug in pool2d kernel

* fix avg pooling, count_include_pad

nits

* test-backend-ops : add more pool_2d tests

* cuda : fix warnings and formatting

* ggml : check types in release builds too in pool_2d

* test-backend-ops : remove f16 pool_2d tests

* cuda : more style fixes

* Add assert in ggml_cuda_op_pool2d

* pool2d float padding fallback

* test-backend-ops : add dst_type to im2col

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-10 09:55:46 +02:00
fc7b0e2c28 ggml : limit n_threads to the max n_tasks (llama/5238) 2024-02-10 09:55:46 +02:00
f850a067ed kompute : llama-bench support and ggml_cpu_has_kompute() (llama/5226) 2024-02-10 09:55:46 +02:00
f75e1197f1 ggml : add abort_callback for cpu backend (ggml/725)
* a way to use abort_callback with the cpu backend

* whisper update
2024-02-10 09:55:46 +02:00
aa8a75e287 extra : update sync scripts 2024-02-10 09:55:19 +02:00
80e8a2ea39 server : allow CORS request with authorization headers (#1850)
Whisper plugin in Obsidian requires an API key which is
then sent as an authorization header.
However, the presence of an authorization header requires
a CORS Preflight, so both the OPTIONS method and
the Access-Control-Allow-Headers: authorization must be
handled.
2024-02-09 17:42:41 +02:00
19f8048139 whisper.android : how to build with CLBlast (#1809)
* FetchContent

* OpenCL

* Documentation and make optional

* Specify GGML build options in build.gradle

* Use gradle properties

* @ggerganov

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* @gpokat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-09 17:39:05 +02:00
0f80e5a80a whisper : expose CUDA device setting in public API (#1840)
* Makefile : allow to override CUDA_ARCH_FLAG

* whisper : allow to select GPU (CUDA) device from public API
2024-02-09 17:27:47 +02:00
b6559333ff make : add macOS deployment target option (#1839) 2024-02-09 17:26:29 +02:00
434b8f3b96 talk-llama : stream response (#1121) 2024-02-06 19:56:12 +02:00
7a74e929c8 sync : ggml (#0) 2024-01-30 21:30:26 +02:00
361ecebe90 ggml : fix IQ3_XXS on Metal (llama/5219)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30 21:28:00 +02:00
807cbc672e sync : ggml (llama/0) 2024-01-30 21:27:59 +02:00
98ae5276b7 Faster AVX2 dot product for IQ2_XS (llama/5187)
* iq2xs: faster AVX2 dot product

* iq2xs: small AVX2 imrovement

* Speed up computing sign bits in AVX2 iq2_xs dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Peter Reid <peter@peterreid.net>
2024-01-30 21:27:59 +02:00
6adb969b09 SOTA 3-bit quants (llama/5196)
* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30 21:27:59 +02:00
8a7d6ff51a ggml alloc: Fix for null dereference on alloc failure (llama/5200)
* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated

* Freeing the allocated buffers rather than the pointer in ggml-alloc.c

* Fixed the fix of the fix
2024-01-30 21:27:59 +02:00
25f650a8e8 Nomic Vulkan backend (llama/4456)
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: niansa <anton-sa@web.de>
Co-authored-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Aaron Miller <apage43@ninjawhale.com>
Co-authored-by: ToKiNoBug <tokinobug@163.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-01-30 21:27:59 +02:00
44e517f074 ggml : add max buffer sizes to opencl and metal backends (llama/5181) 2024-01-30 21:27:59 +02:00
cb9de61659 metal : free metal objects (llama/5161)
* Releasing MTLFunction references after Metal pipeline construction

* Keeping the `ggml_metal_kernel` structure

* Spacing fix

* Whitespace fix
2024-01-30 21:27:59 +02:00
a2ef80d66f gguf : fix comparison (ggml/715)
ggml-ci
2024-01-30 21:27:59 +02:00
baa190446a ggml_cuda_cpy support for 4d tensors and float16->float32 upcasting (ggml/686)
* added cuda float16->float32 upcasting to ggml_cuda_cpy

* added ability to copy 4d tensors with the cuda backend

* added tests for float16_>float32 upcast and 4d tensor cuda copys

* added 4d copy test for float32->float16 copy

* applied patch suggested by @iamlemec

* simplify cpy tests

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-30 21:27:59 +02:00
8f5220d81f gguf : add input validation, prevent integer overflows (ggml/709)
* gguf : add input validation, prevent integer overflows

ggml-ci

* gguf : fix switch default case

* gguf : sanitize info->n_dims and info->type

ggml-ci

* gguf : assert GGUF_TYPE_SIZE access

ggml-ci

* ggml : assert mallocs are successful

ggml-ci

* gguf : prevent integer overflow

* gguf : sanitize tensor info

ggml-ci

* gguf : stricter limit on the number of items

ggml-ci
2024-01-30 21:27:58 +02:00
8e391fcf3a ci : fix yolo URLs + fix metal capture (ggml/712) 2024-01-30 21:27:58 +02:00
593657054e metal : add debug capture backend function (ggml/694)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-30 21:27:58 +02:00
ae5c4f7340 common : fix wav buffer detection (#1819) 2024-01-30 19:35:08 +02:00
baa30bacdb server : add fields to verbose_json response (#1802)
* server: include additional fields in the verbose_json response as OpenAI does

* server: show request examples on home page

* server: todo note for compression_ratio and no_speech_prob

* server: add simple demo form to the homepage
2024-01-30 14:15:55 +02:00
3e6fad07aa make : update MSYS_NT (#1813)
I just upgraded the R wrapper at https://github.com/bnosac/audio.whisper to use whisper.cpp 1.5.4
I'm working on Windows and noticed while doing that that it did not pick up the relevant CFLAGS/CXXFLAGS as my system showed

```
I whisper.cpp build info: 
I UNAME_S:  MSYS_NT-10.0-19045
I UNAME_P:  unknown
I UNAME_M:  x86_64
```

Many thanks for all the tremendous hard work on maintaining whisper.cpp!
2024-01-30 14:13:49 +02:00
e72e4158de talk-llama : sync llama.cpp 2024-01-28 19:44:10 +02:00
bd41733db2 sync : ggml 2024-01-28 19:30:32 +02:00
23c648e98d ggml : add Vulkan backend (llama/2059)
* Vulkan loader code

* Fix matmul kernel, continue implementation

* Continue implementation

* Vulkan memory management

* Vulkan development

* Matmul call

* Add aligned malloc and free for VMA

* Continue implementation

* First matmul success

* GEMM Kernel optimization

* 1D Blocktiling

* 2D Blocktiling

* Write coalescing

* Continue vulkan implementation and optimization

* First FP16 attempt, disabled for now

* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel

* Enable device extensions properly, restore fp16 matmul op

* Fix mulmat_f16

* Output FP32 in fp16 matmul shader

* Fix f16_to_f32 kernel

* dequant_q4_0 kernel

* Add VMA library

* Avoid requesting dedicated memory, VMA can decide that by itself

* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly

* add cmake commands

* Add 2d write operation, profiling code

* Fix 2d write

* Fix queue selection for AMD RADV

* Fix trailing whitespace in vk_mem_alloc.h

* Add WIP warp tile mat mul shaders

* Disable glslc optimization

* Disable glslc optimization for CMake

* Optimize warptile matmul shader, replace blocktile with it

* Add split-k optimization for small matrix multiplication

Use semaphores for synchronization instead of fences or waitidle

Rework async write/read for synchronization

* Fix validation errors, improve compatibility with AMD GPUs

* Rework command buffer handling

* Variable matmul kernel using specialization constants

* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints

* Reuse semaphores

* Handle stage flags during command buffer submission properly

* Increase matmul test runs for consistent results

* Fix F32 matmul

* Add vectorized loading and zeropadding for matrix multiplication

* Use pinned memory for f16 preprocessing

* Don't force aligned matmul

* Don't free before queue done

* Replace VMA library with native Vulkan buffer management

* Basic offloading support with mul_f32 and dmmv for q4_0

* Run glslc commands in parallel

* Unroll loops in dmmv shader

* Reduce usage of waitIdle

* Reuse pinned allocation for f16 conversion

* Handle devices with only a single queue

* Fix trailing whitespace in CMakeLists.txt

* Allow parallel execution of kernels, parallelize third and fourth dimension calls

* Add fallback for devices only supporting one DescriptorSet per DescriptorPool

* Move to graph function similar to CUDA implementation

* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function

* Add F32 dmmv shaders

* Batch submissions

* Add .spv to gitignore

* Split off matrix vector multiplication for separate optimization

* Use single command buffer for matrix vector multiplication ops

* Reduce overhead of mul_f32 calls by using a single command buffer

* Add submission batching to mul_f32

* Fix tests

* Add missing barrier

* Add further missing barrier

* Add further ops

* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions

* Remove unnecessary cblas link

* Fix descriptor set pre-allocation assert

* Add runtime shader compilation, start transferring shaders to this approach

* Transfer remaining shaders to header and compile on runtime

* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16

* Add support for q4_1, q5_0, q5_1 and q8_0

* Remove unnecessary scalar layout extension

* Parse graph early to pre-record command buffers

* Add q6_k support

* Add multi-submit for command buffers

* Fix q6_k dequant shader for AMD

* Fix q6_k for GPUs without fp16 support

* Simplify q6_k fp16 fix

* Minor fixes

* Fix wg_denom of m-mulmat shaders

* Add Python-based Vulkan shader generator

* Replace shaderc dependency with precompiled shaders

Fix python script to generate shaders

* Clean up code

* Fix shader generator script Windows compatibility

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>

* Close file before deletion

* Fix vulkan shader fp32 name

* Add q2_k and q3_k support

Add validation check to compare shader results to cpu results

* Add q4_k support

* Add q5_k support

* Bake SPIR-V bytecode into the library instead of loading shaders from file

* Switch to signal semaphores for flexibility

Prepare broadcasting support for mul mat

* Finish broadcasting mul mat support for GQA

* Clean up unused functions

Add repeat op

* Add further ops, not yet enabled. Improve semaphore code

* Reduce number of used semaphores by utilizing timelines more properly

* Remove queue information

* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations

* Add Vulkan to llama-bench

* Remove cblas dependency

* Fix matmul k-split bug

* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader

* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug

* Fix issues with float16 overflows in shaders

* Fix issues with older Vulkan headers on Ubuntu 22.04

* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers

* Implement further ops, rework op_f32 calls, fix bugs

* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code

* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders

* Merge upstream changes, fix conflicts, adapt soft_max op

* Fix Python and shader header format

* Free model gpu buffers on exit

* Use single queue per device to simplify code

* Add matmul shader support for running multiple calculations in parallel

* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible

* Fix missing event cast

* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity

* Fix warning about empty C function parameters

* Fix compiler warnings

* Properly implement Vulkan backend buffer handling

* Fix oversized host staging buffers

* Simplify barrier synchronization calls

* Fix gcc warnings

* Implement max_size for backend buffer types to limit the size of a single allocation

* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size

* refactor multi buf

* Disable unsupported ops to fix tests

* Check for maintenance4 support before using it

* Handle devices with only a single queue

* Fix single queue logic

* propagate buffer usage in multi buffers

* Implement rope_neox op

* Cleanup header and other files

* Simplify gpu_extras by removing events and putting staging memcpys into contexts

* Move queue into context

Add not-yet-enabled async backend ops

* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization

* Add get_max_size to SYCL backend.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : fix trailing whitespace

---------

Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 19:30:20 +02:00
75ab2d06f5 ggml : add unified SYCL backend for Intel GPUs (llama/2690)
* first update for migration

* update init_cublas

* add debug functio, commit all help code

* step 1

* step 2

* step3 add fp16, slower 31->28

* add GGML_LIST_DEVICE function

* step 5 format device and print

* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue

* support main device is non-zero

* step7 add debug for code path, rm log

* step 8, rename all macro & func from cuda by sycl

* fix error of select non-zero device, format device list

* ren ggml-sycl.hpp -> ggml-sycl.h

* clear CMAKE to rm unused lib and options

* correct queue: rm dtct:get_queue

* add print tensor function to debug

* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481

* summary dpct definition in one header file to replace folder:dpct

* refactor device log

* mv dpct definition from folder dpct to ggml-sycl.h

* update readme, refactor build script

* fix build with sycl

* set nthread=1 when sycl, increase performance

* add run script, comment debug code

* add ls-sycl-device tool

* add ls-sycl-device, rm unused files

* rm rear space

* dos2unix

* Update README_sycl.md

* fix return type

* remove sycl version from include path

* restore rm code to fix hang issue

* add syc and link for sycl readme

* rm original sycl code before refactor

* fix code err

* add know issue for pvc hang issue

* enable SYCL_F16 support

* align pr4766

* check for sycl blas, better performance

* cleanup 1

* remove extra endif

* add build&run script, clean CMakefile, update guide by review comments

* rename macro to intel hardware

* editor config format

* format fixes

* format fixes

* editor format fix

* Remove unused headers

* skip build sycl tool for other code path

* replace tab by space

* fix blas matmul function

* fix mac build

* restore hip dependency

* fix conflict

* ren as review comments

* mv internal function to .cpp file

* export funciton print_sycl_devices(), mv class dpct definition to source file

* update CI/action for sycl code, fix CI error of repeat/dup

* fix action ID format issue

* rm unused strategy

* enable llama_f16 in ci

* fix conflict

* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml

* fix ci cases for unsupported data type

* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL

* revert hip cmake changes

* fix indent

* add prefix in func name

* revert no mmq

* rm cpu blas duplicate

* fix no_new_line

* fix src1->type==F16 bug.

* pass batch offset for F16 src1

* fix batch error

* fix wrong code

* revert sycl checking in test-sampling

* pass void as arguments of ggml_backend_sycl_print_sycl_devices

* remove extra blank line in test-sampling

* revert setting n_threads in sycl

* implement std::isinf for icpx with fast math.

* Update ci/run.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add copyright and MIT license declare

* update the cmd example

---------

Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 19:30:20 +02:00
adc099edee ggml : minor type fix (int64_t -> size_t) 2024-01-28 19:30:17 +02:00
52cce82493 common : fix input buffer check (#1812) 2024-01-27 17:33:09 +02:00
ef3c9ed9eb talk-llama : sync llama.cpp 2024-01-27 17:24:53 +02:00
7fe3ed5e00 sync : ggml 2024-01-27 17:23:25 +02:00
6061241292 Add OpenCL add kernel (llama/5151)
* Add OpenCL add kernel

* Put add kernel into different string to stay within MSVC string length limit, disable float16 support due to bad results
2024-01-27 17:19:52 +02:00
0878ab7c15 cuda : fix tensor size calculation for non-split buffer (llama/5145) 2024-01-27 17:19:52 +02:00
c65edd5b64 ggml-alloc : add 10% margin to the buffer sizes (llama/5149) 2024-01-27 17:19:52 +02:00
3c8d14e9c5 ggml : update softmax n_task calculation (llama/5126)
updated the n_task calculation to use max number of
threads possible. This has improved the prompt eval
performance by around 5% for DOT kernels and by
around 10% for MMLA kernels on AWS Graviton3.
2024-01-27 17:19:52 +02:00
c3977cb2ce metal : remove unused n_buffers and buffers (llama/5129) 2024-01-27 17:19:52 +02:00
6da1661bc2 metal : show compile log messages 2024-01-27 17:19:51 +02:00
cc56540661 cuda : fix 2-bit quants on amd hip (llama/5105)
* cuda : fix 2-bit quants on amd hip

* use __low2float intrinsic function for new quants
2024-01-27 17:19:51 +02:00
94c1ae8668 llama : pre-allocate input tensors in a separate buffer (llama/5100) 2024-01-27 17:19:51 +02:00
55d54359e0 metal : disable support for MUL_MAT F32 x F16 2024-01-27 17:19:51 +02:00
d33c2ad354 CUDA: more info when no device code (llama/5088) 2024-01-27 17:19:51 +02:00
9afa7ff624 minor : clean-up some warnings and style (llama/5094)
* minor : clean-up some warnings and style

ggml-ci

* ggml : add comment
2024-01-27 17:19:51 +02:00
0649289f02 ggml : parallelize FP32 conversion when using BLAS (llama/5045)
* make GGML_TASK_INIT phase can be run in multithread

* multithreaded dequantize in mul_mat when using blas library

* minor fixes

* update outdated comment
* fix coding style

* simplify code

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-27 17:19:51 +02:00
aaeaa43878 llava : MobileVLM support (llama/4954)
* MobileVLM native implementation

* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake

* move android script to example/llava directory

* Fix the editor config checks

---------

Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
2024-01-27 17:19:51 +02:00
078b8e23bf llama : run all KQV ops on the CPU with no KV offload (llama/5049)
ggml-ci
2024-01-27 17:19:51 +02:00
74da3e1757 cuda : fix compile error in jetson platform (llama/4975)
* cuda: fix compile error in jetson platform

* cuda: update comment in ggml-cuda.cu

* cuda: update ggml-cuda.cu comment
2024-01-27 17:19:50 +02:00
2d2c93a798 ggml : check ggml_add src1 type (ggml/708)
Co-authored-by: Judd <foldl@boxvest.com>
2024-01-27 17:19:50 +02:00
4bbb60efce docs : make model options / model install methods clearer (#1806)
* Make models more "discoverable"

* Clean up code block language identifiers

* make 3 options clearer

* undo Prettier formatter change

* docs: `$` shell prompt, consistently

* docs: minor changes
2024-01-26 17:39:54 +02:00
1cf679dec4 cmake : make libwhisper.so position independent (#1792)
This is similar to how libllama.so is built.

Signed-off-by: Tom Rix <trix@redhat.com>
2024-01-22 15:02:35 +02:00
41026c1e4b cmake : temporary remove VLA check (#1795) 2024-01-22 14:51:42 +02:00
d6b9be21d7 whisper.android : return output from benchmarks (#1785)
Benchmarks are failing because JNI expects a jstring and the benchmarks
are missing a return statement (i.e., returning null). The functions
actually build a jstring but don't return it, so this seems to have been
an oversight.

This patch returns the jstring and now the benchmarks run successfully.

Fixes #1783.
2024-01-19 16:17:38 +02:00
c0329acde8 server : implement "verbose_json" format with token details (#1781)
* examples/server: implement "verbose_json" format with token details.

This is intended to mirror the format of openai's Python
whisper.transcribe() return values.

* server: don't write WAV to a temporary file if not converting

* server: use std::lock_guard instead of manual lock/unlock
2024-01-18 22:58:42 +02:00
fb466b3417 ggml : sync ggml-metal.m 2024-01-18 11:03:13 +02:00
1f50a7d29f sync : llama.cpp 2024-01-17 21:23:33 +02:00
1de21b913d sync : ggml 2024-01-17 21:22:38 +02:00
4aea058e5a ggml : add IQ2 to test-backend-ops + refactoring (llama/4990)
* ggml : add IQ2 to test-backend-ops + refactoring

ggml-ci

* cuda : update supports_op for IQ2

ggml-ci

* ci : enable LLAMA_CUBLAS=1 for CUDA nodes

ggml-ci

* cuda : fix out-of-bounds-access in `mul_mat_vec_q`

ggml-ci

* tests : avoid creating RNGs for each Q tensor

ggml-ci

* tests : avoid creating RNGs for each tensor

ggml-ci
2024-01-17 21:21:10 +02:00
fd10234363 imatrix : offload to GPU support (llama/4957)
* backend : add eval callback

ggml-ci

* backend : group nodes in a single compute when user don't need them

* backend : clean-up the implementation

ggml-ci

* simple : do not perform tensor data copy if not needed

* simple : fix

* imatrix : offload to GPU support

* imatrix : fix ggml_mul_mat_id hanlding

ggml-ci

* ci : add imatrix test

ggml-ci

* ci : rearrange output

ggml-ci
2024-01-17 21:21:10 +02:00
8fb5c6a409 backend : add eval callback (llama/4935)
* backend : add eval callback

ggml-ci

* backend : group nodes in a single compute when user don't need them

* backend : clean-up the implementation

ggml-ci

* simple : do not perform tensor data copy if not needed

* simple : fix

* simple : no need for ggml_is_contiguous + fix bool parse

* llama : fix callback placement in llama_context_params

* backend : avoid double-ask callback calls

* simple : restore examples, imatrix will serve as a demo
2024-01-17 21:21:10 +02:00
2fe5fbfcc2 metal : create autorelease pool during library build (llama/4970)
* metal : create autorelease pool during library build

ggml-ci

* test : simplify

ggml-ci
2024-01-17 21:21:10 +02:00
01637e1a4c ggml : importance matrix support for legacy quants (llama/4969)
* imatrix: adding support for legacy quants

* imatrix: guard Q4_0/Q5_0 against ffn_down craziness

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-17 21:21:10 +02:00
1b349eb1f9 metal : log recommendedMaxWorkingSetSize on iOS 16+ (llama/4936)
* metal: Log `recommendedMaxWorkingSetSize` on iOS 16+

* Only log on iOS and macOS, ignoring tvOS and other platforms

* Check for Xcode version before using recommendedMaxWorkingSetSize

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-17 21:21:10 +02:00
138eaebead ggml : introduce GGML_CALL function annotation (llama/4850)
This change makes it possible to build ggml-cuda.cu and ggml-metal.m as
independent dynamic shared objects, that may be conditionally linked at
runtime in a multiplatform binary. It introduces a GGML_CALL annotation
that documents which functions have a cyclic call relationship, between
the application code and GPU modules.

This change does nothing, unless the build defines -DGGML_MULTIPLATFORM
which causes back-references and function pointers to conform to MS ABI
which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms
2024-01-17 21:21:09 +02:00
61b9192f27 cuda : fix dequantize kernel names (llama/4938) 2024-01-17 21:21:09 +02:00
161b51d91a CUDA: faster dequantize kernels for Q4_0 and Q4_1 (llama/4938)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-17 21:21:09 +02:00
f904b31a7d Add ability to use importance matrix for all k-quants (llama/4930)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-17 21:21:09 +02:00
f6614155e4 talk-llama : optional wake-up command and audio confirmation (#1765)
* talk-llama: add optional wake-word detection from command

* talk-llama: add optional audio confirmation before generating answer

* talk-llama: fix small formatting issue in output

* talk-llama.cpp: fix Windows build
2024-01-16 15:52:01 +02:00
f5f159c320 server : fix building and simplify lib deps on Windows (#1772)
* make : fix server example building on MSYS2 environments (Windows)

It was not working since commit eff3570f78
when server was introduced.

* cmake : simplify server example lib deps on Windows

server uses httplib::Server, not httplib::SSLServer, so there is no need
to mention cryptographic libraries in target_link_libraries.
Winsock (ws2_32) suffices here.

Also use plain library names like we use in other places.
2024-01-15 15:48:13 +02:00
6ebba525f1 talk-llama : sync llama.cpp 2024-01-14 18:08:20 +02:00
2a5874441d talk-llama : llama.cpp 2024-01-14 11:06:28 +02:00
d08445c9ad sync : ggml 2024-01-14 10:55:18 +02:00
4a945696cb metal : correctly set SIMD support flags on iOS (llama/4923)
* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad

* log a little bit more info on iOS
2024-01-14 10:54:09 +02:00
dabc964d83 2-bit quantizations (llama/4897)
* imatrix: load

* imatrix: WIP

* imatrix: Add Q2_K quantization

* imatrix: also guard against Q2_K_S quantization without importance matrix

* imatrix: guard even more against low-bit quantization misuse

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 10:54:09 +02:00
654baf693d scripts : sync-ggml-am.sh add option to skip commits 2024-01-14 10:53:19 +02:00
f001a3b7b6 talk-llama : sync llama.cpp 2024-01-14 00:13:17 +02:00
c615f2c335 sync : ggml 2024-01-14 00:12:17 +02:00
d839dd0242 examples : adapt to metal API 2024-01-14 00:11:45 +02:00
435847891c ggml: cache sin/cos for RoPE (llama/4908) 2024-01-14 00:11:45 +02:00
182f290808 metal : remove old API (llama/4919)
ggml-ci
2024-01-14 00:11:45 +02:00
447dfc11fc metal : disable log for loaded kernels (llama/4794) 2024-01-14 00:11:45 +02:00
9aa9f3b84e gguf : fix potential infinite for-loop (llama/4600)
Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>
2024-01-14 00:11:44 +02:00
396ebd1e80 metal : refactor kernel loading code (llama/4794)
* metal : detect more GPU families

* metal : refactor kernel loading

* metal : set kernel family requirements

* metal : fix kernel init + fix compile options

* metal : take into account simdgroup reduction support

* metal : print only skipped kernels

* metal : fix check for simdgroup reduction support

* metal : check for Metal 3

* metal : free allocations

* metal : normalize encoder:setComputePipelineStatus calls

ggml-ci

* metal : fix Metal3 family check

ggml-ci

* metal : check for simdgroup matrix mul. feature

ggml-ci
2024-01-14 00:11:44 +02:00
12490f4398 CUDA: faster q8_0 -> f16 dequantization (llama/4895) 2024-01-14 00:11:44 +02:00
db078a9ba8 talk-llama : add optional CLI arg to set the bot name (#1764) 2024-01-13 20:51:35 +02:00
a13a7da5ad examples : add python example for transcription (#1744)
* rebase and add simple python interface

* moved python files to examples/python
2024-01-13 19:37:18 +02:00
519f8e8684 whisper : load the model into multiple buffers of max size 1GB (#1763) 2024-01-13 17:47:40 +02:00
40ae0962f4 talk-llama : sync llama.cpp 2024-01-12 22:04:51 +02:00
1560288048 sync : ggml 2024-01-12 21:56:50 +02:00
1ad6fafd91 backend_sched : fix assignments
ggml-ci
2024-01-12 21:55:42 +02:00
70840aed5f llama : ggml-backend integration (llama/4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (llama/4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (llama/4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12 21:55:42 +02:00
b24d18feb9 CUDA: fix softmax compile for old CUDA versions (llama/4862) 2024-01-12 21:55:41 +02:00
3fa98f4395 Importance Matrix calculation (llama/4861)
* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 21:55:41 +02:00
d05b7ee90e models : make all scripts to be POSIX Compliant (#1725)
* download-coreml-model: make it POSIX-compliant

* download-ggml-model: posix compliant (2nd)

* minor edit

* forgot to add newline

* generate-coreml-interface: far more straightforward

* generate-coreml-model: done with the posix thingy

* typo

* Update download-ggml-model.sh

* fix

* fix typo

* another fix

* Update download-coreml-model.sh

* Update download-ggml-model.sh

* Update download-coreml-model.sh
2024-01-12 14:11:04 +02:00
6dcee35129 ggml : fix 32-bit ARM compat for IQ2_XS (#1758)
* ggml : fix 32-bit ARM compat

* ggml : fix fix

* ggml : fix fix fix
2024-01-12 14:02:30 +02:00
5cb345f5e9 go : add SetInitialPrompt method to bindings (#1753) 2024-01-12 13:44:50 +02:00
fbcb52d3cd server : add more parameters to server api (#1754)
* feat(server): add more parameters to server api

* fix(server): reset params to original parsed values for each request
2024-01-12 13:42:52 +02:00
6b01e3fedd whisper : fix segment length with params.no_timestamps == true 2024-01-12 13:37:38 +02:00
f7908f9bb8 params : don't compute timestamps when not printing them (#1755) 2024-01-12 13:24:38 +02:00
00b7a4be02 talk-llama : sync llama.cpp 2024-01-11 22:10:10 +02:00
04b0a768b8 swift : remove local ggml.h reference 2024-01-11 22:00:12 +02:00
87670425f2 swift : track ggml release branch 2024-01-11 21:57:40 +02:00
32e71a1861 sync : ggml 2024-01-11 21:54:17 +02:00
9c857cf280 sync : llama.cpp 2024-01-11 21:50:01 +02:00
97b12212dd ggml : SOTA 2-bit quants (add IQ2_XS) (llama/4856)
* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:50:01 +02:00
9fa34d79ec metal : put encoder debug group behind a define (llama/4873) 2024-01-11 21:50:01 +02:00
a0a64a19dd metal : improve dequantize precision to match CPU (llama/4836)
ggml-ci
2024-01-11 21:50:01 +02:00
bbc23611fa ggml : fix vld1q_s8_x4 32-bit compat (llama/4828)
* ggml : fix vld1q_s8_x4 32-bit compat

ggml-ci

* ggml : fix 32-bit ARM compat (cont)

ggml-ci
2024-01-11 21:50:01 +02:00
e9783a1fb4 CUDA: faster softmax via shared memory + fp16 math (llama/4742) 2024-01-11 21:50:01 +02:00
9e0cc28792 metal : fix deprecation warning (ggml/690) 2024-01-11 21:50:00 +02:00
73072a7c73 ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693) 2024-01-11 21:50:00 +02:00
a8ba1262ff metal : wrap each operation in debug group (ggml/690) 2024-01-11 21:50:00 +02:00
e66a9a7806 ggml : change GGML_MAX_NAME at compile time (ggml/682)
* change GGML_MAX_NAME to 128

* allow controlling the value of GGML_MAX_NAME through external macro definitions
2024-01-11 21:50:00 +02:00
338442d773 Fix execlp call (ggml/689)
NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.
2024-01-11 21:50:00 +02:00
10651bddf6 SOTA 2-bit quants (llama/4773)
* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:50:00 +02:00
53d4d0b30d CUDA: fixed redundant value dequantization (llama/4809) 2024-01-11 21:50:00 +02:00
2865e4710b ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (llama/4787) 2024-01-11 21:50:00 +02:00
c46a74a19d ggml : do not sched_yield when calling BLAS (llama/4761)
* ggml : do not sched_yield when calling BLAS

ggml-ci

* ggml : fix do_yield logic

ggml-ci

* ggml : simplify do_yield logic

ggml-ci
2024-01-11 21:50:00 +02:00
46dc49a6a1 ggml : include stdlib.h before intrin.h (llama/4736) 2024-01-11 21:49:59 +02:00
cc7f872131 swift : checkout ggml commit instead of branch (#1750) 2024-01-10 18:12:06 +02:00
bcc1658cd0 talk-llama : add optional Piper TTS support (#1749)
Add commented-out command to optionally use Piper (https://github.com/rhasspy/piper) as text-to-speech solution for the talk-llama example. Piper voices sound almost like real people which is a big improvement (e.g.) from something like espeak.
2024-01-10 16:15:28 +02:00
c46886f599 server : add request path option(#1741) 2024-01-08 22:39:51 +00:00
29f78392c1 main : add cli option to disable system prints (#1740) 2024-01-08 16:41:28 +02:00
022756a872 server : fix server temperature + add temperature_inc (#1729)
* server : fix server temperature + add temperature_inc

* server : change dashes to underscores in parameter names
2024-01-07 13:35:14 +02:00
3b8c2dff57 talk-llama : sync latest llama.cpp 2024-01-06 17:22:57 +02:00
0b9af32a8b release : v1.5.4 2024-01-05 17:11:27 +02:00
11b1b63b14 fix : cuda order of synchronization when setting a buffer (ggml/679)
* fix : cuda order of synchronization when setting a buffer

* also sync before memcpy

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-05 17:01:59 +02:00
0e26a6c92e metal : switch back to default.metallib (ggml/681)
ggml-ci
2024-01-05 16:31:30 +02:00
66d8f0b7f1 ggml : fix q2_k bpw in comments (ggml/680) 2024-01-05 16:31:20 +02:00
ba5bcde874 coreml : fix ANE optimized encoder (#1716) 2024-01-04 16:28:30 +02:00
ab0a8593c5 whisper.swiftui : add .gitignore 2024-01-04 15:00:27 +02:00
668ffc9b23 whispser : reset the "batched" timings (#1721) 2024-01-04 13:38:39 +02:00
9962371f71 release : v1.5.3 2024-01-03 19:36:33 +02:00
993acb5d41 swift : update Package.swift to use ggml as package dependency (#1701)
* updates Package.swift to use ggml as dependency

* cleans up the Package.swift file by removing redundant source files

* updates ggml url src to ggerganov
2024-01-03 19:30:26 +02:00
a3d0aa73d1 ggml : add error handling to graph_compute (#1714) 2024-01-03 15:39:43 +02:00
14c57952f7 cuda : simplify expression
Co-authored-by: slaren <slarengh@gmail.com>
2024-01-03 14:43:51 +02:00
6c369d6788 cuda : mark I16 and I32 ops as unsupported
ggml-ci
2024-01-03 14:43:51 +02:00
4cdd9aad9b metal : add kernel_get_rows_i32
ggml-ci
2024-01-03 14:43:51 +02:00
f38c057503 metal : optimize ggml_mul_mat_id (faster Mixtral PP) (llama/4725)
* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (llama/4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

* metal : optimizing ggml_mul_mat_id (wip)

* metal : minor fix

* metal : opt mul_mm_id
2024-01-03 14:43:51 +02:00
1e5544b39b metal : enable shader debugging (cmake option) (llama/4705)
* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (llama/4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

ggml-ci
2024-01-03 14:43:51 +02:00
d5673af79f ggml : add ggml_vdotq_s32 alias (llama/4715)
ggml-ci
2024-01-03 14:43:51 +02:00
a28dacec65 CUDA: fixed tensor cores not being used on RDNA3 (llama/4697) 2024-01-03 14:43:51 +02:00
dbe29d4e33 ggml : add ggml_cpu_has_avx_vnni() (llama/4589)
* feat: add avx_vnni based on intel documents

* ggml: add avx vnni based on intel document

* llama: add avx vnni information display

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* Update ggml.c

Fix indentation upgate

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03 14:43:51 +02:00
fe3a67c546 CUDA: fix tensor core logic for Pascal and HIP (llama/4682) 2024-01-03 14:43:51 +02:00
b138ff2be3 cuda: fix vmm oom issue on NVIDIA AGX Orin (llama/4687)
Signed-off-by: hydai <hydai@secondstate.io>
2024-01-03 14:43:51 +02:00
cf6f1e4181 ggml : extend ggml_get_rows, ggml_repeat, ggml_concat (ggml/639)
* add more int ops

* ggml_compute_forward_dup_bytes

* add tests

* PR comments

* tests : minor indentations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03 14:43:51 +02:00
620a223814 scripts : fix sync order + metal sed 2024-01-03 14:43:51 +02:00
f39f9690ec examples : fix WASM Stack Overflow (#1713)
Fix for problem:

"""
RuntimeError: Aborted(Stack overflow! Stack cookie has been overwritten at 0x12be2b10, expected hex dwords 0x89BACDFE and 0x2135467, but received 0x00000000 0x00000000)
"""

That appears when executing the WASM example with the newer versions.
2024-01-02 16:50:04 +00:00
f9ca90256b docker : fix the publishing of the CUDA Docker image (#1704) 2023-12-30 23:12:31 +02:00
2623640cd6 scripts : do not sync commits from this repo 2023-12-29 15:03:08 +02:00
d87de61ae6 ci : build with CLBlast + ggml-opencl use GGML_API (#1576)
* Build with CLBlast

* Declare GGML_API

After rebasing, examples/talk-llama failed:

"D:\a\whisper.cpp\whisper.cpp\build\ALL_BUILD.vcxproj" (build target) (1) ->
"D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj" (default target) (14) ->
(Link target) ->
  llama.obj : error LNK2019: unresolved external symbol ggml_cl_free_data referenced in function "public: __cdecl llama_model::~llama_model(void)" (??1llama_model@@QEAA@XZ) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
  llama.obj : error LNK2019: unresolved external symbol ggml_cl_transform_tensor referenced in function "public: void __cdecl llama_model_loader::load_all_data(struct ggml_context *,void (__cdecl*)(float,void *),void *,struct llama_mlock *)" (?load_all_data@llama_model_loader@@QEAAXPEAUggml_context@@P6AXMPEAX@Z1PEAUllama_mlock@@@Z) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
  D:\a\whisper.cpp\whisper.cpp\build\bin\Release\talk-llama.exe : fatal error LNK1120: 2 unresolved externals [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
2023-12-29 12:23:27 +02:00
f5f485f899 whisper : replace tensor->n_dims with ggml_n_dims(tensor) (#1694) 2023-12-29 11:38:35 +02:00
e77b27c331 sync : ggml (VMM, sync-ggml-am, dotprod ARM fixes, CUDA fixes) (#1691)
* scripts : add sync-ggml-am.sh

* sync : ggml (VMM, ARM dot prod fix, etc.)

* build : fix CUDA build

* ggml : fix some mul mat cases + add tests for src1 F16

dbd02958fa
2023-12-29 11:30:47 +02:00
a5cc3dc8a2 download : fix large q5 model name (#1695)
fixed typo in large-v3-q5-0 model name to match HF link
2023-12-29 11:14:32 +02:00
37a709f655 whisper : Replace WHISPER_PRINT_DEBUG with WHISPER_LOG_DEBUG (#1681) 2023-12-23 12:02:58 +00:00
3a5302108d sync : ggml (ggml_scale, ggml_row_size, etc.) (#1677)
* sync : ggml

* sync : llama.cpp

* talk-llama : fix obsolete param

* ggml-alloc : fix ggml_tallocr_is_own

* talk.wasm : update to new ggml

* ggml : fix type punning in ggml_scale

* ggml : cuda jetson + arm quants warnings
2023-12-22 17:53:39 +02:00
d2ee117a0a docker : Dockerize whisper.cpp (#1674)
* build: add dockerfile for ci

* ci: add action to build/push docker image

* fix: lowercase repository to fix ci

* ci: update cuBLAS flag

* build: install curl and ffmped in image

* docs: add docker section

* fix: improve args check when download model
2023-12-22 11:16:02 +00:00
db8ccdb850 CI : Add coverage for talk-llama when WHISPER_CUBLAS=1 (#1672) 2023-12-21 22:39:46 +00:00
d2419030b0 examples : Revert CMakeLists.txt for talk-llama (#1669) 2023-12-21 22:48:52 +02:00
8986690c2a cmake : set default CUDA architectures (#1667) 2023-12-21 15:44:04 +02:00
9286d3f584 bench.py : add different large models (#1655)
Amend different large v1,v2,v3 models to benchmark.
2023-12-19 12:40:14 +02:00
940de9dbe9 wchess : update README.md 2023-12-14 22:00:47 +02:00
88112c8afb release : v1.5.2 2023-12-14 17:56:39 +02:00
375585c07c wchess : update readme 2023-12-14 17:51:14 +02:00
fd99ece8e3 wchess : whisper assisted chess (#1595)
* wchess: whisper assisted chess

* wchess: fix allowed moves in check

* wchess: touchstart, touchend events

* wchess: css, disabled button

* wchess : html touches

* wchess : minor fixes and code style

* wchess : bump encoder context to 1280

* wchess : index.html

* wchess : fix CI warnings

* wchess : add array header

* wchess : build static library

* wchess : display grammar

* wchess : update UX

* wchess : add comment

* wchess : add README

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-14 15:58:26 +02:00
8171e621fc sync : ggml (Metal fixes, new ops, tests) (#1633)
* sync : ggml (Metal fixes, new ops, tests)

* cuda : fix bin bcast when src1 and dst have different types
2023-12-13 21:55:03 +02:00
ec03661b20 cmake : target windows 8 or above for prefetchVirtualMemory in llama-talk (#1617)
Since we use prefetchVirtualMemory we specify we target win 8 or above, otherwise other compilers will refuse to use the prefetchVirtualMemory api, (I understand you are loading it dynamically but the header definition has this limitation)
2023-12-12 11:35:00 +00:00
6335933a5b cmake : Fix bug in httplib.h for mingw (#1615)
Fix bug in httlib.h for mingw, please see https://github.com/yhirose/cpp-httplib/issues/1669
2023-12-10 17:47:52 +00:00
885b5563d0 metal : fix ggml_metal_log vargs (#1606) 2023-12-08 13:50:50 +02:00
9521ba6801 whisper.objc : disable timestamps for real-time transcription 2023-12-08 13:43:37 +02:00
29511d33c7 whisper : more debug messages + fix fallback logic 2023-12-08 13:43:12 +02:00
7bc4d22337 metal : fix soft_max kernel src1 argument (#1602) 2023-12-08 13:39:32 +02:00
afce6fa113 sync : ggml (new ops, new backend, etc) (#1602)
* sync : ggml (new ops, new backend, etc)

* whisper : remove obsolete broadcasting code

* ggml : remove backend self-registers + fix ggml_concat + n_task logic

* metal : fix assert

* metal : print resource path

* whisper : fix bug if metal init fails
2023-12-07 22:27:19 +02:00
3163090d89 server : pass max-len argument to the server (#1574)
This commit fixes the missing parameter binding for max-len between the input
arguments and wparams.
2023-12-05 23:01:45 +02:00
f0efd0202d ios : Remove #if arch(arm) check for using Metal (#1561) 2023-12-05 01:14:26 +00:00
3c28d1a571 ggml : Fix 32-bit compiler warning (#1575)
Warning about %lu on 32-bit targets. Updated to %zu.
2023-12-03 14:15:28 +00:00
e369243ebd ggml : re-enable blas for src0 != F32 (#1583) 2023-12-01 23:57:52 +02:00
a0ec3fac54 Server : Add support for .vtt format to Whisper server (#1578)
- The code comes from examples/main
- The output mimetype is set to text/vtt

Example usage:
```shell
curl 127.0.0.1:8080/inference \
-H "Content-Type: multipart/form-data" \
-F file="@samples/jfk.wav" \
-F temperature="0.2" \
-F response-format="vtt"
```
2023-11-30 23:44:26 +00:00
6559b538e5 server : backport .srt output format (#1565)
This commit adds a support of .srt format to Whisper server. The code is
effectively backported from examples/main. The output mimetype is set to
application/x-subrip as per https://en.wikipedia.org/wiki/SubRip.

Example usage:

  curl 127.0.0.1:8080/inference \
    -H "Content-Type: multipart/form-data" \
    -F file="@<file-path>" \
    -F temperature="0.2" \
    -F response-format="srt"
2023-11-28 15:42:58 +02:00
73d5005880 cmake : install required ggml.h header (#1568) 2023-11-28 15:41:49 +02:00
6b094b6dfe server : set default CORS headers to allow all (#1567) 2023-11-28 11:55:20 +02:00
641f2f4282 readme : update help (#1560) 2023-11-27 12:04:08 +02:00
bfacd9f8ce CI : Add CUDA 11.8.0 support (#1554)
* try to fix cublas build in CI

* add multiple cuda-toolkit version

* Update build.yml

* Disable CUDA-toolkit 10.2.89
2023-11-27 12:03:16 +02:00
f52e74d4dc CI : Rectify the Clang-Related workflow issues (#1551)
* fix bugs in workflow

* fix missing clang in workflow

* Update build.yml
2023-11-27 11:35:37 +02:00
23c21e92eb server : automatically convert audio on the server (#1539)
* server : automatically convert audio on the server

* server : remove rebundant comments

* server : automatic conversion refactor

* server : update server readme

* server : remove unnecessary comments and tabs

* server : put back remove calling

* server : apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server : check ffmpeg before the server lunch

* server : fix indentation

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server : fix function typo calling

* server : fix function typo calling

* server : add warning in readme

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-27 11:28:34 +02:00
447d49530c whisper : remove trailing whitespaces 2023-11-24 13:13:21 +02:00
9d6ebd877c release : v1.5.1 2023-11-24 12:41:55 +02:00
0ba365f958 metal : add backend function to check device family support (#1547) 2023-11-24 12:37:08 +02:00
010c8ec3ab cuda : sync some minor stuff from llama.cpp (#1548) 2023-11-24 12:36:21 +02:00
ffdb5c4735 whisper : fix typo 2023-11-24 09:45:10 +02:00
a5881d619c server : add --print-realtime param (#1541)
* server : add --print-realtime param

* Fix duplicate realtime output
2023-11-24 09:35:02 +02:00
34f70b3a56 whisper : add whisper_lang_str_full (#1546)
* Update whisper.h

add whisper_lang_fullstr to retrieve the full language name

* Update whisper.cpp

add whisper_lang_fullstr to return the full language name

* fullstr -> str_full

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-24 09:33:13 +02:00
8328d1900f fix(server): typo in temperature parameter (#1545)
Also fixed another typo in comments.
2023-11-23 20:59:36 +02:00
d2bd5f0bdc metal : fix build (#1544) 2023-11-23 20:20:53 +02:00
34209a37a2 readme : add server example 2023-11-23 17:20:33 +02:00
180e062eda go : fixed Makefile for MacOS ARM 64 (#1530)
* Fixed Makefile for MacOS ARM 64 based on https://github.com/ggerganov/whisper.cpp/issues/1344 + proper ggml-metal env var setting

* conditional to fix broken non-macos compilation

* spaces -> tab

* make : fix whitespaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-22 18:08:11 +02:00
5c7be85fdc Change temp file name for server application (#1535)
Avoid issue of removing file if it exists in the current working
directory
2023-11-22 09:23:36 +01:00
146169ec38 bench : pass memcpy threads from cli 2023-11-21 22:27:22 +02:00
9befab5ab9 bench : multi-thread memcpy (#1534) 2023-11-21 22:07:30 +02:00
9ac88f2b57 Close file after writing in server application (#1533)
Fix of mistake leaving file open while reading it again as wav
2023-11-21 20:36:10 +01:00
46f5b6cb08 server : add video to readme 2023-11-21 17:30:43 +02:00
eff3570f78 server : add a REST Whisper server example with OAI-like API (#1380)
* Add first draft of server

* Added json support and base funcs for server.cpp

* Add more user input via api-request

also some clean up

* Add reqest params and load post function

Also some general clean up

* Remove unused function

* Add readme

* Add exception handlers

* Update examples/server/server.cpp

* make : add server target

* Add magic curl syntax

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-20 21:40:24 +02:00
fa19bc4195 whisper : update example in whisper.h (#1529)
update the example in the header, previous examples deprecated.
2023-11-20 20:52:27 +02:00
a01b2e0971 sdl : fix audio callback (#1523) 2023-11-20 13:16:38 +02:00
8159a9ab99 whisper : reuse whisper_decode_with_state (#1521) 2023-11-20 13:16:11 +02:00
7516d9c16d ci : redistribute CUDA DLLs (#1522)
see https://docs.nvidia.com/cuda/eula/index.html#attachment-a
2023-11-19 12:43:22 +02:00
46cc26d1b9 whisper : fix with_state methods to use the correct state (#1519)
Co-authored-by: Sandro Hanea <sandrohanea@microsoft.com>
2023-11-19 11:25:30 +02:00
f784f9fa12 whisper : fix overriding the audio context 2023-11-19 10:32:32 +02:00
ca23f8ee6d cuda : assert ggml_add sources to be contiguous 2023-11-19 10:32:08 +02:00
e2f0eba2d4 ios : sync submodule 2023-11-17 10:42:04 +02:00
d4353e48f7 sync : ggml (ggml-alloc + linker + gguf fixes) (#1501) 2023-11-17 10:00:07 +02:00
bebf0da983 quantize : add support for K-quant types 2023-11-16 16:18:24 +02:00
848e54f3ad bench : fix memcpy bench size 2023-11-16 10:59:32 +02:00
7883d1cae4 talk-llama : improve quote and backtick handling (#1364)
* ISSUE-1329: replace " with ' so it doesn't try to execute code in backticks.

* Typo

* Update to keep possessives in the output

Closes the ' then puts a ' in quotes then reopens the ' to escape the ' characters.
2023-11-16 10:34:05 +02:00
ccc85b4ff8 talk-llama : enable GPU by default 2023-11-15 21:33:00 +02:00
c7606b47df models : add info about distilled models 2023-11-15 21:10:13 +02:00
d38af151a1 release : v1.5.0 2023-11-15 21:02:52 +02:00
94267df08e bench-all : add distil models 2023-11-15 20:49:12 +02:00
8713c67133 js : latest whisper.js 2023-11-15 20:10:16 +02:00
57a60639bb bench-all : indentations 2023-11-15 20:01:15 +02:00
bfbaa4dce5 whisper : make large version explicit + fix data size units (#1493) 2023-11-15 19:42:25 +02:00
1d79e78402 java : fix test (#1492) 2023-11-15 17:42:53 +02:00
b6c5f49b78 whisper : add batched decoding (#1486)
* whisper : add whisper_batch

* whisper : move kv_self to whisper_state

* whisper : full batched decoding support

* whisper : fix memory leak in whisper_batch

* whisper : fix mem leak again + remove oboslete function

* whisper : clear kv cache when using whisper_decode API

* whisper : speed-up sampling

* whisper : fix decoders initializer

* bench : add batch size 5 bench

* whisper : add comment about the KV cache size

* whisper : add check for max number of decoders

* whisper : avoid starting sampling threads with bs=1

* whisper : enable beam-search by default

* cuda : sync llama.cpp fixes
2023-11-15 16:12:52 +02:00
d4231649e6 java : use tiny.en for tests (#1484)
* java : use tiny.en for tests

* java : try to fix full params struct
2023-11-13 16:53:55 +02:00
3e5c7feeff whisper : add grammar-based sampling (#1229)
* whisper : add grammar-based sampling

* build : fix after master merge

* command : fix exception when recognizing the command

* whisper : fine-tuning grammar functionality

* command : grammar-related improvements

- option to read grammar from file
- add sample grammars for colors and chess moves
- fine-tune the performance further

* grammars : add assistant + update comments

* command : enable beam-search, add "no_timestamps", add "context", add p

* whisper : remove comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-13 10:51:34 +02:00
c23598e4ca talk-llama : add n_gpu_layers parameter (#1475) 2023-11-13 10:04:16 +02:00
54a08bde29 examples : add whisper.android.java for compatibility with older Android versions using Java (#1382)
* save the recorded audio to a file

* Alignment -help

* Save the correct audio

* chage to a consistent coding style

* Correct typo

* Update examples/stream/stream.cpp

* Update examples/stream/stream.cpp

* Correct variable misuse

* Update examples/stream/stream.cpp

* Update examples/stream/stream.cpp

* Update examples/stream/stream.cpp

* Update examples/stream/stream.cpp

* add *.bin .cxx/ .gradle/ cmake-build-debug/ to gitignore

* add whisper.android.java

* Added support for older versions of Android of Java

* add examples for android java

* add README.md for android java

* add fullTranscribeWithTime

* 增加 toString()方法和测试

* change return type to void

* update to v1.4.1

* add WhisperService

* chage to whisper_full_get_segment_t1

* add method transcribeDataWithTime

* modified toString
```
return "[" + start + " --> " + end + "]:" + sentence;
```

* Optimize code logic

* update text view on handle

* set max lines

* change Chinese to English

* Update bindings/java/build.gradle

* Update .gitignore

* add android.java to github action

* chage android.java to   android_java in build.yml

* remove gradle

* chage jdk to temurin in android_java of CI

* chage jdk to temurin 11 in android_java of CI

* add x to gradlew

* set api-level for android_java of CI

* Update examples/whisper.android.java/app/src/main/jni/whisper/CMakeLists.txt

* add ndk version in build.gradle

* remove local.properties

* add testFullTranscribeWithTime

---------

Co-authored-by: litongmacos <litongjava@qq.com>
Co-authored-by: bobqianic <129547291+bobqianic@users.noreply.github.com>
2023-11-12 18:31:58 +02:00
9f8bbd3fee readme : update comment about source code 2023-11-12 17:47:37 +02:00
3172006a24 ggml : fix some compile warnings 2023-11-12 16:36:20 +02:00
684bc8bd70 readme : update GPU / CUDA 2023-11-12 15:40:37 +02:00
b0502836b8 whisper : add full CUDA and Metal offloading (#1472)
* whisper : migrate to ggml-backend

* whisper : fix logit reading

* whisper : fix tensor allocation during load

* whisper : fix beam-search with CUDA

* whisper : free backends + fix compile warning

* whisper : print when CUDA is enabled

* whisper : fix CoreML

* make : clean-up

* talk : fix compile warning

* whisper : support ggml_conv with CUDA and Metal (#1473)

* ggml : add CUDA support for ggml_conv

* whisper : remove ggml_repeat for conv bias + single backend

* cuda : fix im2col kernel

* metal : add im2col support + mul mat-vec f16 x f16

* bench-all : add q4 models

* whisper : clean-up

* quantize-all : fix

* ggml : im2col opts

* whisper : avoid whisper_model_data wrapper

* whisper : add note that ggml_mul_mat_pad does not work with CUDA

* whisper : factor out graph compute in common function

* whisper : fixes

* whisper : fix UB with measure buffers

* whisper : try to fix the parallel whisper_state functionality (#1479)

* whisper : try to fix the parallel whisper_state functionality

* whisper : fix multi-state Metal

* whisper : free backend instances in whisper_state
2023-11-12 15:31:08 +02:00
ec7a6f04f9 whisper : return with error from whisper_encode_internal and whisper_decode_internal when abort callback is true (#1456)
Co-authored-by: Ben Nortier <ben@bjnortier.com>
2023-11-10 13:51:16 +02:00
37947203e6 talk-llama : add language auto detect (#1467)
* Add '-l auto' to talk-llama example

* Update examples/talk-llama/talk-llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-09 19:21:44 +02:00
953419c69a openvino : update convert-whisper-to-openvino.py to support v3 (#1459) 2023-11-09 12:42:39 +02:00
0de8582f65 coreml : use the correct n_mel value (#1458) 2023-11-08 20:01:41 +00:00
baeb733691 whisper : reset mel time when resetting timings (#1452)
Co-authored-by: Ben Nortier <ben@bjnortier.com>
2023-11-08 15:52:23 +02:00
d03c60dd7f ios : add support for Swift Package Manager (#1370)
* Add support for Swift

* Make it build in Xcode

* Use the SPM package in the SwiftUI example app
2023-11-07 23:53:31 +02:00
6a5d195109 release : v1.4.3 2023-11-07 16:15:48 +02:00
0cbef75422 ggml : fix MIN / MAX macro re-definition 2023-11-07 16:08:46 +02:00
2cdfc4e025 whisper : add support for large v3 (#1444)
* whisper : add support for large v3

* bench : fix build + fix go bindings

* bench : fix n_mels

* models : update readme
2023-11-07 15:30:18 +02:00
973111088b android : decouple example into a library and app module (#1445) 2023-11-07 14:27:33 +02:00
11b503055e whisper : reset ctx->t_start_us when calling whisper_reset_timings() (#1434)
Co-authored-by: Ben Nortier <ben@bjnortier.com>
2023-11-07 11:04:32 +02:00
718 changed files with 200323 additions and 65378 deletions

View File

@ -12,7 +12,7 @@ FROM ${BASE_CUDA_DEV_CONTAINER} as build
ARG CUDA_DOCKER_ARCH=all
RUN apt-get update && \
apt-get install -y build-essential git cmake
apt-get install -y build-essential git cmake libsdl2-dev
WORKDIR /app
@ -21,7 +21,7 @@ COPY . .
# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable cuBLAS
ENV WHISPER_CUBLAS=1
ENV GGML_CUDA=1
RUN make

View File

@ -0,0 +1,40 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG CUDA_VERSION=12.3.1
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
# Target the CUDA runtime image
ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}
FROM ${BASE_CUDA_DEV_CONTAINER} AS build
WORKDIR /app
# Unless otherwise specified, we make a fat build.
ARG CUDA_DOCKER_ARCH=all
# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable cuBLAS
ENV GGML_CUDA=1
RUN apt-get update && \
apt-get install -y build-essential libsdl2-dev \
&& rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*
# Ref: https://stackoverflow.com/a/53464012
ENV CUDA_MAIN_VERSION=12.3
ENV LD_LIBRARY_PATH /usr/local/cuda-${CUDA_MAIN_VERSION}/compat:$LD_LIBRARY_PATH
COPY .. .
RUN make
FROM ${BASE_CUDA_RUN_CONTAINER} AS runtime
ENV CUDA_MAIN_VERSION=12.3
ENV LD_LIBRARY_PATH /usr/local/cuda-${CUDA_MAIN_VERSION}/compat:$LD_LIBRARY_PATH
WORKDIR /app
RUN apt-get update && \
apt-get install -y curl ffmpeg \
&& rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*
COPY --from=build /app /app
ENTRYPOINT [ "bash", "-c" ]

19
.devops/main.Dockerfile Normal file
View File

@ -0,0 +1,19 @@
FROM ubuntu:22.04 AS build
WORKDIR /app
RUN apt-get update && \
apt-get install -y build-essential \
&& rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*
COPY .. .
RUN make
FROM ubuntu:22.04 AS runtime
WORKDIR /app
RUN apt-get update && \
apt-get install -y curl ffmpeg libsdl2-dev \
&& rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*
COPY --from=build /app /app
ENTRYPOINT [ "bash", "-c" ]

View File

@ -13,10 +13,10 @@ jobs:
ubuntu-latest:
runs-on: ubuntu-latest
steps:
- uses: actions/setup-go@v3
- uses: actions/setup-go@v5
with:
go-version: '^1.19'
- uses: actions/checkout@v1
go-version: '^1.23'
- uses: actions/checkout@v4
- run: |
cd bindings/go
make test

View File

@ -3,20 +3,73 @@ on:
push:
paths:
- bindings/ruby/**
- whisper.h
- src/whisper.cpp
- include/whisper.h
- ggml/src/ggml.c
- ggml/src/ggml-impl.h
- ggml/src/ggml-aarch64.h
- ggml/src/ggml-aarch64.c
- ggml/src/ggml-alloc.c
- ggml/src/ggml-backend-impl.h
- ggml/src/ggml-backend.cpp
- ggml/src/ggml-common.h
- ggml/src/ggml-quants.h
- ggml/src/ggml-quants.c
- ggml/src/ggml-cpu-impl.h
- ggml/src/ggml-metal.m
- ggml/src/ggml-metal.metal
- ggml/src/ggml-blas.cpp
- ggml/include/ggml.h
- ggml/include/ggml-alloc.h
- ggml/include/ggml-backend.h
- ggml/include/ggml-cuda.h
- ggml/include/ggml-kompute.h
- ggml/include/ggml-metal.h
- ggml/include/ggml-sycl.h
- ggml/include/ggml-vulkan.h
- ggml/include/ggml-blas.h
- scripts/get-flags.mk
- examples/dr_wav.h
pull_request:
paths:
- bindings/ruby/**
- whisper.h
- src/whisper.cpp
- include/whisper.h
- ggml/src/ggml.c
- ggml/src/ggml-impl.h
- ggml/src/ggml-aarch64.h
- ggml/src/ggml-aarch64.c
- ggml/src/ggml-alloc.c
- ggml/src/ggml-backend-impl.h
- ggml/src/ggml-backend.cpp
- ggml/src/ggml-common.h
- ggml/src/ggml-quants.h
- ggml/src/ggml-quants.c
- ggml/src/ggml-cpu-impl.h
- ggml/src/ggml-metal.m
- ggml/src/ggml-metal.metal
- ggml/src/ggml-blas.cpp
- ggml/include/ggml.h
- ggml/include/ggml-alloc.h
- ggml/include/ggml-backend.h
- ggml/include/ggml-cuda.h
- ggml/include/ggml-kompute.h
- ggml/include/ggml-metal.h
- ggml/include/ggml-sycl.h
- ggml/include/ggml-vulkan.h
- ggml/include/ggml-blas.h
- scripts/get-flags.mk
- examples/dr_wav.h
jobs:
ubuntu-latest:
runs-on: ubuntu-latest
defaults:
run:
working-directory: bindings/ruby
steps:
- uses: ruby/setup-ruby@v1
with:
ruby-version: '3.0'
- uses: actions/checkout@v1
- run: |
cd bindings/ruby/ext
ruby extconf.rb && make
- uses: actions/checkout@v4
- run: rake test

View File

@ -3,6 +3,7 @@ on: [push, pull_request]
env:
ubuntu_image: "ubuntu:22.04"
VCPKG_BINARY_SOURCES: "clear;x-gha,readwrite"
jobs:
ubuntu-latest:
@ -15,16 +16,17 @@ jobs:
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
uses: docker/setup-qemu-action@v3
- name: Build ${{ matrix.arch }}
run: |
docker run --platform ${{ matrix.arch }} --rm \
-v ${{ github.workspace }}:/workspace \
-w /workspace ${{ env.ubuntu_image }} /bin/sh -c '
set -e
apt update
apt install -y build-essential libsdl2-dev
make
@ -35,7 +37,7 @@ jobs:
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Dependencies
run: |
@ -52,13 +54,13 @@ jobs:
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Build
uses: cross-platform-actions/action@v0.15.0
uses: cross-platform-actions/action@v0.24.0
with:
operating_system: freebsd
version: '13.2'
version: '13.3'
run: |
sudo pkg update
sudo pkg install -y gmake sdl2
@ -76,16 +78,17 @@ jobs:
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
uses: docker/setup-qemu-action@v3
- name: Build ${{ matrix.arch }}
run: |
docker run --platform ${{ matrix.arch }} --rm \
-v ${{ github.workspace }}:/workspace \
-w /workspace ${{ env.ubuntu_image }} /bin/sh -c '
set -e
apt update
apt install -y build-essential cmake libsdl2-dev
cmake . -DWHISPER_SDL2=ON -DCMAKE_BUILD_TYPE=${{ matrix.build }}
@ -99,22 +102,26 @@ jobs:
fail-fast: false
matrix:
build: [Debug, Release]
arch: [linux/amd64, linux/arm64, linux/arm/v7, linux/ppc64le]
#arch: [linux/amd64, linux/arm64, linux/arm/v7, linux/ppc64le]
# TODO: arm/v7 disabled due to clang bug
# https://github.com/ggerganov/whisper.cpp/actions/runs/9657764109/job/26637633042?pr=2256#step:4:1990
arch: [linux/amd64, linux/arm64, linux/ppc64le]
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
uses: docker/setup-qemu-action@v3
- name: Build ${{ matrix.arch }}
run: |
docker run --platform ${{ matrix.arch }} --rm \
-v ${{ github.workspace }}:/workspace \
-w /workspace ${{ env.ubuntu_image }} /bin/sh -c '
set -e
apt update
apt install -y build-essential cmake libsdl2-dev
apt install -y clang build-essential cmake libsdl2-dev
cmake . -DWHISPER_SDL2=ON -DCMAKE_BUILD_TYPE=${{ matrix.build }} -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang
make
ctest -L gh --output-on-failure'
@ -130,22 +137,181 @@ jobs:
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
uses: docker/setup-qemu-action@v3
- name: Build ${{ matrix.arch }}
run: |
docker run --platform ${{ matrix.arch }} --rm \
-v ${{ github.workspace }}:/workspace \
-w /workspace ${{ env.ubuntu_image }} /bin/sh -c '
set -e
apt update
apt install -y build-essential cmake
cmake . -DCMAKE_BUILD_TYPE=Debug -DWHISPER_SANITIZE_${{ matrix.sanitizer }}=ON
make
ctest -L gh --output-on-failure'
ubuntu-22-cmake-sycl:
runs-on: ubuntu-22.04
strategy:
fail-fast: false
matrix:
dwhisper_sycl: [ON]
dcmake_c_compiler: [icx]
dcmake_cxx_compiler: [icpx]
arch: [linux/amd64, linux/arm64, linux/arm/v7, linux/ppc64le]
continue-on-error: true
steps:
- name: Clone
uses: actions/checkout@v4
- name: add oneAPI to apt
shell: bash
run: |
cd /tmp
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
rm GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo add-apt-repository "deb https://apt.repos.intel.com/oneapi all main"
- name: install oneAPI dpcpp compiler
shell: bash
run: |
sudo apt update
sudo apt install intel-oneapi-compiler-dpcpp-cpp
- name: install oneAPI MKL library
shell: bash
run: |
sudo apt install intel-oneapi-mkl-devel
- name: Clone
id: checkout
uses: actions/checkout@v4
- name: Build
id: cmake_build
run: |
source /opt/intel/oneapi/setvars.sh
mkdir build
cd build
cmake -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx ..
cmake --build . --config Release -j $(nproc)
ubuntu-22-cmake-sycl-fp16:
runs-on: ubuntu-22.04
strategy:
fail-fast: false
matrix:
dwhisper_sycl: [ON]
dcmake_c_compiler: [icx]
dcmake_cxx_compiler: [icpx]
arch: [linux/amd64, linux/arm64, linux/arm/v7, linux/ppc64le]
continue-on-error: true
steps:
- name: Clone
uses: actions/checkout@v4
- name: add oneAPI to apt
shell: bash
run: |
cd /tmp
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
rm GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo add-apt-repository "deb https://apt.repos.intel.com/oneapi all main"
- name: install oneAPI dpcpp compiler
shell: bash
run: |
sudo apt update
sudo apt install intel-oneapi-compiler-dpcpp-cpp
- name: install oneAPI MKL library
shell: bash
run: |
sudo apt install intel-oneapi-mkl-devel
- name: Clone
id: checkout
uses: actions/checkout@v4
- name: Build
id: cmake_build
run: |
source /opt/intel/oneapi/setvars.sh
mkdir build
cd build
cmake -DGGML_SYCL_F16=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx ..
cmake --build . --config Release -j $(nproc)
windows-msys2:
runs-on: windows-latest
strategy:
fail-fast: false
matrix:
include:
- { sys: UCRT64, env: ucrt-x86_64, build: Release }
- { sys: CLANG64, env: clang-x86_64, build: Release }
steps:
- name: Clone
uses: actions/checkout@v4
- name: Setup ${{ matrix.sys }}
uses: msys2/setup-msys2@v2
with:
update: true
msystem: ${{matrix.sys}}
install: >-
base-devel
mingw-w64-${{matrix.env}}-toolchain
mingw-w64-${{matrix.env}}-cmake
mingw-w64-${{matrix.env}}-SDL2
mingw-w64-${{matrix.env}}-openblas
- name: Build using make
shell: msys2 {0}
run: |
make -j $(nproc)
- name: Clean after building using make
shell: msys2 {0}
run: |
make clean
- name: Build using make w/ OpenBLAS
shell: msys2 {0}
run: |
make GGML_OPENBLAS=1 -j $(nproc)
- name: Build using CMake
shell: msys2 {0}
run: |
cmake -B build
cmake --build build --config ${{ matrix.build }} -j $(nproc)
- name: Clean after building using CMake
shell: msys2 {0}
run: |
rm -rf build
- name: Build using CMake w/ OpenBLAS
shell: msys2 {0}
run: |
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config ${{ matrix.build }} -j $(nproc)
windows:
runs-on: windows-latest
@ -162,14 +328,14 @@ jobs:
s2arc: x64
jnaPath: win32-x86-64
- sdl2: ON
s2ver: 2.26.0
s2ver: 2.28.5
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Add msbuild to PATH
uses: microsoft/setup-msbuild@v1
uses: microsoft/setup-msbuild@v2
- name: Fetch SDL2 and set SDL2_DIR
if: matrix.sdl2 == 'ON'
@ -194,14 +360,14 @@ jobs:
run: copy "$env:SDL2_DIR/../lib/${{ matrix.s2arc }}/SDL2.dll" build/bin/${{ matrix.build }}
- name: Upload dll
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
name: ${{ matrix.jnaPath }}_whisper.dll
path: build/bin/${{ matrix.build }}/whisper.dll
- name: Upload binaries
if: matrix.sdl2 == 'ON'
uses: actions/upload-artifact@v1
uses: actions/upload-artifact@v4
with:
name: whisper-bin-${{ matrix.arch }}
path: build/bin/${{ matrix.build }}
@ -217,29 +383,31 @@ jobs:
sdl2: [ON]
include:
- arch: Win32
obzip: https://github.com/OpenMathLib/OpenBLAS/releases/download/v0.3.24/OpenBLAS-0.3.24-x86.zip
s2arc: x86
- arch: x64
obzip: https://github.com/OpenMathLib/OpenBLAS/releases/download/v0.3.24/OpenBLAS-0.3.24-x64.zip
s2arc: x64
- sdl2: ON
s2ver: 2.26.0
s2ver: 2.28.5
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Export GitHub Actions cache environment variables
uses: actions/github-script@v7
with:
script: |
core.exportVariable('ACTIONS_CACHE_URL', process.env.ACTIONS_CACHE_URL || '');
core.exportVariable('ACTIONS_RUNTIME_TOKEN', process.env.ACTIONS_RUNTIME_TOKEN || '');
- name: Add msbuild to PATH
uses: microsoft/setup-msbuild@v1
uses: microsoft/setup-msbuild@v2
- name: Fetch OpenBLAS
- name: Install OpenBLAS and pkgconfiglite
if: matrix.blas == 'ON'
run: |
C:/msys64/usr/bin/wget.exe -qO blas.zip ${{ matrix.obzip }}
7z x blas.zip -oblas -y
copy blas/include/cblas.h .
copy blas/include/openblas_config.h .
echo "OPENBLAS_PATH=$env:GITHUB_WORKSPACE/blas" >> $env:GITHUB_ENV
vcpkg install --triplet=${{ matrix.s2arc }}-windows openblas
choco install pkgconfiglite
- name: Fetch SDL2 and set SDL2_DIR
if: matrix.sdl2 == 'ON'
@ -251,9 +419,10 @@ jobs:
- name: Configure
run: >
cmake -S . -B ./build -A ${{ matrix.arch }}
-DCMAKE_TOOLCHAIN_FILE="$env:VCPKG_INSTALLATION_ROOT/scripts/buildsystems/vcpkg.cmake"
-DCMAKE_BUILD_TYPE=${{ matrix.build }}
-DWHISPER_OPENBLAS=${{ matrix.blas }}
-DCMAKE_LIBRARY_PATH="$env:OPENBLAS_PATH/lib"
-DGGML_BLAS=${{ matrix.blas }}
-DGGML_BLAS_VENDOR=OpenBLAS
-DWHISPER_SDL2=${{ matrix.sdl2 }}
- name: Build
@ -261,9 +430,9 @@ jobs:
cd ./build
msbuild ALL_BUILD.vcxproj -t:build -p:configuration=${{ matrix.build }} -p:platform=${{ matrix.arch }}
- name: Copy libopenblas.dll
- name: Copy openblas.dll
if: matrix.blas == 'ON'
run: copy "$env:OPENBLAS_PATH/bin/libopenblas.dll" build/bin/${{ matrix.build }}
run: copy "C:/vcpkg/packages/openblas_${{ matrix.s2arc }}-windows/bin/openblas.dll" build/bin/${{ matrix.build }}
- name: Copy SDL2.dll
if: matrix.sdl2 == 'ON'
@ -271,13 +440,13 @@ jobs:
- name: Upload binaries
if: matrix.blas == 'ON' && matrix.sdl2 == 'ON'
uses: actions/upload-artifact@v1
uses: actions/upload-artifact@v4
with:
name: whisper-blas-bin-${{ matrix.arch }}
path: build/bin/${{ matrix.build }}
windows-cublas:
runs-on: windows-latest
runs-on: windows-2019
strategy:
matrix:
@ -285,22 +454,25 @@ jobs:
arch: [x64]
cublas: [ON]
sdl2: [ON]
cuda-toolkit: [12.2.0, 11.8.0]
include:
- arch: x64
s2arc: x64
- sdl2: ON
s2ver: 2.26.0
s2ver: 2.28.5
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Add msbuild to PATH
uses: microsoft/setup-msbuild@v1
uses: microsoft/setup-msbuild@v2
- name: Install CUDA Toolkit
id: cuda-toolkit
uses: Jimver/cuda-toolkit@v0.2.10
uses: Jimver/cuda-toolkit@v0.2.15
with:
cuda: '${{ matrix.cuda-toolkit }}'
- name: Fetch SDL2 and set SDL2_DIR
if: matrix.sdl2 == 'ON'
@ -313,12 +485,20 @@ jobs:
run: >
cmake -S . -B ./build -A ${{ matrix.arch }}
-DCMAKE_BUILD_TYPE=${{ matrix.build }}
-DWHISPER_CUBLAS=1
-DGGML_CUDA=${{ matrix.cublas }}
-DWHISPER_SDL2=${{ matrix.sdl2 }}
- name: Build
- name: Build ${{ matrix.cuda-toolkit }}
run: |
cd ./build
msbuild ALL_BUILD.vcxproj -t:build -p:configuration=${{ matrix.build }} -p:platform=${{ matrix.arch }}
cmake --build . --config ${{ matrix.build }}
- name: Copy CUDA DLLs
run: >
Copy-Item -PassThru
-Path "${{ steps.cuda-toolkit.outputs.CUDA_PATH }}/bin/*.dll"
-Include cudart64_*,cublas64_*,cublasLt64_*
-Destination build/bin/${{ matrix.build }}
- name: Copy SDL2.dll
if: matrix.sdl2 == 'ON'
@ -326,9 +506,9 @@ jobs:
- name: Upload binaries
if: matrix.sdl2 == 'ON'
uses: actions/upload-artifact@v1
uses: actions/upload-artifact@v4
with:
name: whisper-cublas-bin-${{ matrix.arch }}
name: whisper-cublas-${{ matrix.cuda-toolkit }}-bin-${{ matrix.arch }}
path: build/bin/${{ matrix.build }}
emscripten:
@ -340,10 +520,10 @@ jobs:
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup emsdk
uses: mymindstorm/setup-emsdk@v12
uses: mymindstorm/setup-emsdk@v14
- name: Verify
run: emcc -v
@ -362,7 +542,7 @@ jobs:
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Configure
run: |
@ -380,70 +560,106 @@ jobs:
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
with:
path: whisper
- name: Install Java
uses: actions/setup-java@v3
uses: actions/setup-java@v4
with:
distribution: zulu
java-version: 17
java-version: 21
- name: Setup Android SDK
uses: android-actions/setup-android@v2
uses: android-actions/setup-android@v3
- name: Build
run: |
cd examples/whisper.android
cd whisper/examples/whisper.android
./gradlew assembleRelease --no-daemon
java:
needs: [ 'windows' ]
runs-on: windows-latest
steps:
- uses: actions/checkout@v3
- name: Install Java
uses: actions/setup-java@v1
with:
java-version: 17
- name: Download Windows lib
uses: actions/download-artifact@v3
with:
name: win32-x86-64_whisper.dll
path: bindings/java/build/generated/resources/main/win32-x86-64
- name: Build
- name: Build with external ggml
run: |
models\download-ggml-model.cmd tiny.en
cd bindings/java
chmod +x ./gradlew
./gradlew build
export PATH_TO_GGML=$PWD/ggml
cd whisper/examples/whisper.android
./gradlew assembleRelease --no-daemon
- name: Upload jar
uses: actions/upload-artifact@v3
with:
name: whispercpp.jar
path: bindings/java/build/libs/whispercpp-*.jar
# TODO: disable because of following fail: https://github.com/ggerganov/whisper.cpp/actions/runs/11019444420/job/30627193602
# android_java:
# runs-on: ubuntu-latest
#
# steps:
# - name: Clone
# uses: actions/checkout@v4
#
# - name: set up JDK 11
# uses: actions/setup-java@v4
# with:
# java-version: '11'
# distribution: 'temurin'
# cache: gradle
#
# - name: Setup Android SDK
# uses: android-actions/setup-android@v3
# with:
# cmdline-tools-version: 9.0
#
# - name: Build
# run: |
# cd examples/whisper.android.java
# chmod +x ./gradlew
# ./gradlew assembleRelease
- name: Publish package
if: ${{ github.ref == 'refs/heads/master' }}
uses: gradle/gradle-build-action@v2.4.2
with:
arguments: publish
build-root-directory: bindings/java
env:
MAVEN_USERNAME: ${{ secrets.JIRA_USER }}
MAVEN_PASSWORD: ${{ secrets.JIRA_PASS }}
PGP_SECRET: ${{ secrets.GPG_PRIVATE_KEY }}
PGP_PASSPHRASE: ${{ secrets.GPG_PASSPHRASE }}
# TODO: disabled because of following fail: https://github.com/ggerganov/whisper.cpp/actions/runs/9686220096/job/26735899598
# java:
# needs: [ 'windows' ]
# runs-on: windows-latest
# steps:
# - uses: actions/checkout@v4
#
# - name: Install Java
# uses: actions/setup-java@v4
# with:
# distribution: zulu
# java-version: 20
#
# - name: Download Windows lib
# uses: actions/download-artifact@v4
# with:
# name: win32-x86-64_whisper.dll
# path: bindings/java/build/generated/resources/main/win32-x86-64
#
# - name: Build
# run: |
# models\download-ggml-model.cmd tiny.en
# cd bindings/java
# chmod +x ./gradlew
# ./gradlew build
#
# - name: Upload jar
# uses: actions/upload-artifact@v4
# with:
# name: whispercpp.jar
# path: bindings/java/build/libs/whispercpp-*.jar
#
# - name: Publish package
# if: ${{ github.ref == 'refs/heads/master' }}
# uses: gradle/gradle-build-action@v2.4.2
# with:
# arguments: publish
# build-root-directory: bindings/java
# env:
# MAVEN_USERNAME: ${{ secrets.JIRA_USER }}
# MAVEN_PASSWORD: ${{ secrets.JIRA_PASS }}
# PGP_SECRET: ${{ secrets.GPG_PRIVATE_KEY }}
# PGP_PASSPHRASE: ${{ secrets.GPG_PASSPHRASE }}
quantize:
runs-on: ubuntu-latest
steps:
- name: Clone
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Test quantize
run: |

59
.github/workflows/docker.yml vendored Normal file
View File

@ -0,0 +1,59 @@
name: Publish Docker image
on:
pull_request:
push:
branches:
- master
jobs:
push_to_registry:
name: Push Docker image to Docker Hub
if: github.event.pull_request.draft == false
runs-on: ubuntu-latest
env:
COMMIT_SHA: ${{ github.sha }}
strategy:
matrix:
config:
- { tag: "main", dockerfile: ".devops/main.Dockerfile", platform: "linux/amd64,linux/arm64" }
#TODO: the cuda image keeps failing - disable for now
# https://github.com/ggerganov/whisper.cpp/actions/runs/11019444428/job/30602020339
#- { tag: "main-cuda", dockerfile: ".devops/main-cuda.Dockerfile", platform: "linux/amd64" }
steps:
- name: Check out the repo
uses: actions/checkout@v3
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.repository_owner }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push Docker image (versioned)
if: github.event_name == 'push'
uses: docker/build-push-action@v5
with:
context: .
push: true
platforms: ${{ matrix.config.platform }}
tags: "ghcr.io/${{ github.repository }}:${{ matrix.config.tag }}-${{ env.COMMIT_SHA }}"
file: ${{ matrix.config.dockerfile }}
- name: Build and push Docker image (tagged)
uses: docker/build-push-action@v4
with:
context: .
push: ${{ github.event_name == 'push' }}
platforms: ${{ matrix.config.platform }}
tags: "ghcr.io/${{ github.repository }}:${{ matrix.config.tag }}"
file: ${{ matrix.config.dockerfile }}

View File

@ -37,7 +37,7 @@ jobs:
run: npm install
- name: Compile addon.node
run: npx cmake-js compile -T whisper-addon -B Release
run: npx cmake-js compile -T addon.node -B Release
- name: Download test model
run: |

24
.gitignore vendored
View File

@ -3,20 +3,21 @@
.cache/
.coreml/
.test/
.venv/
.vs/
.vscode/
.DS_Store
.vimspector.json
/CMakeSettings.json
/talk-llama.dSYM/
build/
build-em/
build-debug/
build-release/
build-rwdi/
build-static/
build-cublas/
build-no-accel/
build-sanitize-addr/
build-sanitize-thread/
build-*/
# SPM
.build/
.swiftpm
*.metallib
/main
/stream
@ -25,6 +26,7 @@ build-sanitize-thread/
/talk-llama
/bench
/quantize
/server
/lsp
arm_neon.h
@ -48,3 +50,7 @@ bindings/java/.idea/
.idea/
benchmark_results.csv
cmake-build-debug/
.cxx/
.gradle/
local.properties

3
.gitmodules vendored
View File

@ -1,3 +0,0 @@
[submodule "bindings/ios"]
path = bindings/ios
url = https://github.com/ggerganov/whisper.spm

301
AUTHORS Normal file
View File

@ -0,0 +1,301 @@
# date: Tue Apr 9 20:27:03 EEST 2024
# this file is auto-generated by scripts/gen-authors.sh
0/0 <zero@imaskeleton.me>
0cc4m <picard12@live.de>
0xsourcecode <134374803+0xsourcecode@users.noreply.github.com>
AT <manyoso@users.noreply.github.com>
Aarni Koskela <akx@iki.fi>
Aaron Pham <29749331+aarnphm@users.noreply.github.com>
Aaron Taylor <aaron@exphat.com>
Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
Abitofevrything <54505189+abitofevrything@users.noreply.github.com>
AfryMask <AfryMask@163.com>
Ahmad Bilal <ahmad.bilal@empglabs.com>
AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>
Akash Mahajan <akash7190@gmail.com>
Akash Mahajan <akashmjn@stanford.edu>
Al Hoang <3811822-hoanga@users.noreply.gitlab.com>
Alan <unknown>
Aleksander Andrzejewski <18704749+aleksanderandrzejewski@users.noreply.github.com>
Alex Azarov <alex@azarov.by>
Alex Bacart <13940752+alex-bacart@users.noreply.github.com>
Alex Evgrashin <aevgrashin@yandex.ru>
Alexandr Graschenkov <alexandr.graschenkov91@gmail.com>
Alexandru Mariuti <alex@mariuti.com>
Alexey Kharlamov <alexey@kharlamov.biz>
Alfredo Montesinos <alfredo.montesinos@g.austincc.edu>
Ali Alameh <ali.alameh@isae.edu.lb>
Ananta Bastola <anantarajbastola@gmail.com>
Andreu Huguet <andreuhuguet@gmail.com>
Andrew Huynh <a5thuynh@gmail.com>
Andrew S <andrews54757@gmail.com>
Andy Maloney <asmaloney@gmail.com>
Anton Kostin <masguit42@users.noreply.github.com>
Artyom Mezin <psycho.fading@gmail.com>
Asad Memon <asad.lionpk@gmail.com>
Ashraful Islam <ashraful.meche@gmail.com>
AsukaMinato <asukaminato@nyan.eu.org>
AustinMroz <austinmroz@utexas.edu>
Avik Sengupta <avik@sengupta.net>
Bader-eddine Ouaich <49657842+baderouaich@users.noreply.github.com>
Baffin Lee <baffinlee@gmail.com>
Ben Nortier <bjnortier@gmail.com>
Benjamin Heiniger <benjamin.heiniger@bluewin.ch>
Bo-Yi Wu <appleboy.tw@gmail.com>
Boris Bliznioukov <blib@mail.com>
Borislav Stanimirov <b.stanimirov@abv.bg>
Brad Murray <59848399+bradmurray-dt@users.noreply.github.com>
Brian Murray <brian@bmurray.ca>
CRD716 <crd716@gmail.com>
Canis Lupus <Canis-UK@users.noreply.github.com>
Carolinabanana <140120812+Carolinabanana@users.noreply.github.com>
ChangSeok Oh <shivamidow@users.noreply.github.com>
Chaoqun <27287694+OpenWaygate@users.noreply.github.com>
Chia-Hsiang Cheng <88014292+garychia@users.noreply.github.com>
Chidi Williams <williamschidi1@gmail.com>
Christian <12550267+iceychris@users.noreply.github.com>
Clifford Heath <clifford.heath@gmail.com>
Colin <github@whoisc.cc>
DGdev91 <DGdev91@users.noreply.github.com>
Damian Czaja <trojan295@protonmail.com>
Daniel Bevenius <daniel.bevenius@gmail.com>
David <dnhkng@gmail.com>
David Thorpe <djt@mutablelogic.com>
Davidson Francis <davidsondfgl@gmail.com>
Dener Stassun <denerstassun@gmail.com>
Didzis Gosko <didzis@users.noreply.github.com>
Digipom <admin@digipom.com>
Dimo <dimo@ieee.org>
Dody Suria Wijaya <dodysw@gmail.com>
Dr. Tom Murphy VII Ph.D <499244+tom7@users.noreply.github.com>
Duncan McConnell <ddmcconnell4@gmail.com>
Egor Egorov <me@egorfine.com>
Elkana Bardugo <ttv200@gmail.com>
Emmanuel Schmidbauer <eschmidbauer@gmail.com>
Engininja2 <139037756+Engininja2@users.noreply.github.com>
Eric Swanson <eswanson@alloscomp.com>
Eric Tendian <erictendian@gmail.com>
Erik Scholz <Green-Sky@users.noreply.github.com>
Evan Jones <evan.q.jones@gmail.com>
Evan Martin <evan.martin@gmail.com>
Eve <139727413+netrunnereve@users.noreply.github.com>
Evgeny Kuznetsov <evgeny@kuznetsov.md>
F1L1P <78918286+F1L1Pv2@users.noreply.github.com>
Fangjun Kuang <csukuangfj@gmail.com>
Felix <stenbackfelix@gmail.com>
Finn Voorhees <finnvoorhees@gmail.com>
FlippFuzz <41221030+FlippFuzz@users.noreply.github.com>
Gang Chen <goncha@gmail.com>
Gavin Cai <gavin1818@hotmail.com>
George Hindle <george@georgehindle.com>
Georgi Gerganov <ggerganov@gmail.com>
GitAritron <103900385+GitAritron@users.noreply.github.com>
GiviMAD <GiviMAD@users.noreply.github.com>
Gleicon Moraes <gleicon@gmail.com>
Gregor Jasny <gjasny@googlemail.com>
Guillaume Wenzek <gwenzek@users.noreply.github.com>
HY. Kelvin Lee <34256578+hykelvinlee42@users.noreply.github.com>
Halalaluyafail3 <55773281+Halalaluyafail3@users.noreply.github.com>
Hang <bebound@gmail.com>
Herman Semenov <GermanAizek@yandex.ru>
Hrishikesh Barman <geekodour@users.noreply.github.com>
Ian Bicking <ian@ianbicking.org>
Ian Bull <irbull@eclipsesource.com>
Ikko Ashimine <eltociear@gmail.com>
InconsolableCellist <23345188+InconsolableCellist@users.noreply.github.com>
Ismatulla Mansurov <47342870+sapoepsilon@users.noreply.github.com>
Ivan Gorin <ivangorin21@gmail.com>
JJ <103335846+computerscienceiscool@users.noreply.github.com>
Jack Mousseau <jmousseau@users.noreply.github.com>
JacobLinCool <jacoblincool@gmail.com>
Jakub Ráček <blizzcz@gmail.com>
Jared Van Bortel <jared@nomic.ai>
Jay Binks <jaybinks@gmail.com>
Jhen-Jie Hong <developer@jhen.me>
Jhen-Jie Hong <iainst0409@gmail.com>
JidongZhang-THU <1119708529@qq.com>
Jo Liss <joliss42@gmail.com>
Johan <jr.raffin@gmail.com>
Johannes Gäßler <johannesg@5d6.de>
John Balis <phobossystems@gmail.com>
Jonathan Soo <jcsoo@agora.com>
Jonno <1160532+razodactyl@users.noreply.github.com>
Joonas Pihlajamaa <joonas.pihlajamaa@iki.fi>
Jose <34888496+Jerry-Master@users.noreply.github.com>
Josh Bleecher Snyder <josharian@gmail.com>
Judd <foldl@users.noreply.github.com>
Jumper775 <78500318+jumpers775@users.noreply.github.com>
Justine Tunney <jtunney@gmail.com>
KP Kaiser <kirk@zothcorp.com>
Kamilake <exjang0@gmail.com>
Kartik Saranathan <278928+Kartiku@users.noreply.github.com>
Kasumi <90275229+kasumi-1@users.noreply.github.com>
Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Kevin Brothaler <admin@digipom.com>
Konstantin Zhuravlyov <konstantin.zhuravlyov@amd.com>
Kreijstal <rainb@tfwno.gf>
Kylin <56434533+KyL0N@users.noreply.github.com>
LBlue <153975653+lbluep@users.noreply.github.com>
Larry Battle <larry.battle.tech@gmail.com>
Laytan Laats <laytanlaats@hotmail.com>
Leo Moll <leo.moll@yeasoft.com>
Lexevolution <31176843+Lexevolution@users.noreply.github.com>
LittleLoli <26589867+WhichWho@users.noreply.github.com>
Lucas Zanek <57494138+LucasZNK@users.noreply.github.com>
Luis Herrera <herrera-luis@users.noreply.github.com>
Lukas Rist <glaslos@gmail.com>
M. A. Ali <73258591+MightyStud@users.noreply.github.com>
M. Eren Akbiyik <erenakbiyik@gmail.com>
Maciek <maciek.mab122@gmail.com>
Marcin Mielniczuk <marmistrz.dev@zoho.eu>
Martin Warnaar <martinwarnaar@gmail.com>
Matheus de Sousa <23645013+keyehzy@users.noreply.github.com>
Mathijs de Bruin <mathijs@mathijsfietst.nl>
Matija Pevec <mightymatth@users.noreply.github.com>
Maximiliano Levi <8160966+maxilevi@users.noreply.github.com>
Meng, Hengyu <hengyu.meng@intel.com>
Michael Podvitskiy <podvitskiymichael@gmail.com>
Michael Rienstra <mrienstra@gmail.com>
Mikhail Grigorev <sleuthhound@gmail.com>
Mohammadreza Hendiani <hendiani.mohammadreza@gmail.com>
Mohit Agarwal <mohit@sdf.org>
Murilo Santana <mvrilo@gmail.com>
Neil Chudleigh <nchudleigh@users.noreply.github.com>
Neo Zhang Jianyu <jianyu.zhang@intel.com>
Neuman Vong <neuman.vong@gmail.com>
Nicholas Albion <nalbion@yahoo.com>
Niels Mayer <Niels.Mayer@gmail.com>
Okabintaro <103938900+Okabintaro@users.noreply.github.com>
Oleg Sidorov <me@whitebox.io>
Oleg Sidorov <oleg@sidorov.nl>
Ondrej Kokes <ondrej.kokes@gmail.com>
Ouadie EL FAROUKI <ouadie.elfarouki@codeplay.com>
Paul Tsochantaris <ptsochantaris@icloud.com>
Philipp Zabel <philipp.zabel@gmail.com>
Philippe Normand <phil@base-art.net>
Przemysław Pawełczyk <przemoc@gmail.com>
Qianhe Chen <54462604+chenqianhe@users.noreply.github.com>
Radosław Gryta <radek.gryta@gmail.com>
Reinforce-II <fate@eastal.com>
Reinis Muiznieks <muiznieks.reinis@gmail.com>
RelatedTitle <r3latedtitle@gmail.com>
RhinoDevel <RhinoDevel@users.noreply.github.com>
Rich Jones <miserlou@gmail.com>
Robin <robin.xw@hotmail.com>
Roddur Dasgupta <roddurd@gmail.com>
Roland Rabien <figbug@gmail.com>
Rotem Dan <rotemdan@gmail.com>
Ryan Hitchman <hitchmanr@gmail.com>
Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
RyanChang <ftes90015@gmail.com>
Sam <49637763+Onlyartist9@users.noreply.github.com>
Sam Pullara <spullara@gmail.com>
Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Sergio López <slp@sinrega.org>
Siddharth Ramakrishnan <srr2141@columbia.edu>
Simon Moisselin <simon.moisstoll@gmail.com>
Sindre Sorhus <sindresorhus@gmail.com>
Slava Primenko <primenko.s@gmail.com>
Syahmi Azhar <prsyahmi@gmail.com>
Syed Jafri <syedjafri97@gmail.com>
Sơn Phan Trung <phantrungson17@gmail.com>
Taisei Mima <bhbstar.me@gmail.com>
Takeshi Inoue <inoue.takeshi@gmail.com>
Tamotsu Takahashi <ttakah+github@gmail.com>
Taras Glek <taras@thegp.com>
Tauseef Mohiuddin <35351464+tauseefmohammed2@users.noreply.github.com>
Thijs Raymakers <thijs@raymakers.nl>
Thomas Fitzsimmons <fitzsim@fitzsim.org>
Tiago Fassoni <tiagofassoni@users.noreply.github.com>
Tienshiao Ma <tienshiao@tienshiao.org>
Timothy Cronin <40186632+4imothy@users.noreply.github.com>
Tobrun <tobrun.van.nuland@gmail.com>
Todd <taf2@users.noreply.github.com>
Tong Li <31761981+litongjava@users.noreply.github.com>
Topping1 <78745143+Topping1@users.noreply.github.com>
Travis Cline <travis.cline@gmail.com>
UEXTM.com <84163508+uextm@users.noreply.github.com>
Vadim Peretokin <vperetokin@hey.com>
Valentin Gosu <1454649+valenting@users.noreply.github.com>
Vulcan <93451215+trholding@users.noreply.github.com>
WhiteOlivierus <36532695+WhiteOlivierus@users.noreply.github.com>
Xiang (Kevin) Li <kevinli020508@gmail.com>
Xiao-Yong Jin <jinxiaoyong@gmail.com>
XiaotaoChen <chenxiaotao1234@gmail.com>
Yajing Tang <phillis@google.com>
Yang Shen <aplshenyang@gmail.com>
Yunès <jean.baptiste.yunes@free.fr>
ZaBlazzingZephyrus <119159668+blazingzephyr@users.noreply.github.com>
Zigfrid Zvezdin <ziggerZZ@gmail.com>
Zollner <24618122+Zolliner@users.noreply.github.com>
ai-at-home <149282006+ai-at-home@users.noreply.github.com>
alonfaraj <alonfaraj@gmail.com>
andypayne <apayne@gmail.com>
ardfork <134447697+ardfork@users.noreply.github.com>
automaticcat <daogiatuank54@gmail.com>
be-next <jerome.ramette@gmail.com>
bert hubert <bert@hubertnet.nl>
bmwl <brian.marshall@tolko.com>
bobqianic <129547291+bobqianic@users.noreply.github.com>
bocytko <bocytko+github@gmail.com>
boolemancer <48014766+boolemancer@users.noreply.github.com>
boolemancer <boolemancer@gmail.com>
bradmit <151883577+bradmit@users.noreply.github.com>
brunofaustino <b.fa.amorim@gmail.com>
bssrdf <merlintiger@hotmail.com>
byte-6174 <88070277+byte-6174@users.noreply.github.com>
cdosoftei <ciprian.dosoftei@gmail.com>
clach04 <Chris.Clark@actian.com>
compilade <113953597+compilade@users.noreply.github.com>
conradg <conradjgodfrey@gmail.com>
ddpasa <112642920+ddpasa@users.noreply.github.com>
denersc <denerstassun@gmail.com>
dscripka <dscripka@users.noreply.github.com>
duthils <duthils@duthils.net>
ecneladis <ecneladis@users.noreply.github.com>
faker <nspyia2002@gmail.com>
fitzsim <fitzsim@fitzsim.org>
fraxy-v <65565042+fraxy-v@users.noreply.github.com>
genevera (she/her) <genevera@users.noreply.github.com>
geniusnut <geniusnut@gmail.com>
greeshmay <greeshmay@gmail.com>
hydai <z54981220@gmail.com>
iamthad <thadeus.j.fleming@gmail.com>
james wolf <contractorwolf@hotmail.com>
joecryptotoo <80373433+joecryptotoo@users.noreply.github.com>
jorismertz <35079666+jorismertz@users.noreply.github.com>
junkfood <69683722+JunkFood02@users.noreply.github.com>
jwijffels <jwijffels@bnosac.be>
kamranjon <kamranjon@gmail.com>
katsu560 <katsu560oo-@docomo.ne.jp>
kennethge <57784063+kenneth-ge@users.noreply.github.com>
keyehzy <msamuel@aluno.puc-rio.br>
leejet <leejet714@gmail.com>
litong <31761981+litongjava@users.noreply.github.com>
lnyan <lkwq007@gmail.com>
m.bell <m.bell@techsmith.com>
mkiol <mkiol@users.noreply.github.com>
novag <7754358+novag@users.noreply.github.com>
pajowu <pajowu@pajowu.de>
polarmoon <90010972+polarmoon@users.noreply.github.com>
rlapray <lapray.romain@gmail.com>
sandrohanea <40202887+sandrohanea@users.noreply.github.com>
semiformal-net <84111142+semiformal-net@users.noreply.github.com>
shibukazu <61775791+shibukazu@users.noreply.github.com>
shikokuchuo <53399081+shikokuchuo@users.noreply.github.com>
slaren <slarengh@gmail.com>
slashlib <slashlib@users.noreply.github.com>
snadampal <87143774+snadampal@users.noreply.github.com>
st-gr <38470677+st-gr@users.noreply.github.com>
texmex76 <40733439+texmex76@users.noreply.github.com>
thefinaldegree <thefinaldegree@gmail.com>
trixirt <trix@redhat.com>
ulatekh <ulatekh@yahoo.com>
undef <undefdev@gmail.com>
venkr <venkateshrameshkumar+1@gmail.com>
vicalloy <zbirder@gmail.com>
xdrudis <xavierdrudis@yahoo.es>
zhouwg <6889919+zhouwg@users.noreply.github.com>
布客飞龙 <562826179@qq.com>
Артём Земляк <azemlyak@smart-consulting.ru>

View File

@ -1,21 +1,31 @@
cmake_minimum_required (VERSION 3.5)
cmake_minimum_required(VERSION 3.5) # for add_link_options and implicit target directories.
project("whisper.cpp" C CXX)
project("whisper.cpp" VERSION 1.7.2)
include(CheckIncludeFileCXX)
project(whisper.cpp VERSION 1.4.2)
set(SOVERSION 1)
#set(CMAKE_WARN_DEPRECATED YES)
set(CMAKE_WARN_UNUSED_CLI YES)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
if (NOT XCODE AND NOT MSVC AND NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "MinSizeRel" "RelWithDebInfo")
endif()
# Add path to modules
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake/")
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
if(CMAKE_SOURCE_DIR STREQUAL CMAKE_CURRENT_SOURCE_DIR)
if (CMAKE_SOURCE_DIR STREQUAL CMAKE_CURRENT_SOURCE_DIR)
set(WHISPER_STANDALONE ON)
include(GitVars)
include(BuildTypes)
include(git-vars)
# configure project version
if (EXISTS "${CMAKE_SOURCE_DIR}/bindings/ios/Makefile-tmpl")
configure_file(${CMAKE_SOURCE_DIR}/bindings/ios/Makefile-tmpl ${CMAKE_SOURCE_DIR}/bindings/ios/Makefile @ONLY)
endif()
configure_file(${CMAKE_SOURCE_DIR}/bindings/javascript/package-tmpl.json ${CMAKE_SOURCE_DIR}/bindings/javascript/package.json @ONLY)
else()
set(WHISPER_STANDALONE OFF)
@ -25,6 +35,11 @@ if (EMSCRIPTEN)
set(BUILD_SHARED_LIBS_DEFAULT OFF)
option(WHISPER_WASM_SINGLE_FILE "whisper: embed WASM inside the generated whisper.js" ON)
# TODO: without these, we get the following error:
# wasm-ld: error: --shared-memory is disallowed by whisper.cpp.o because it was not compiled with 'atomics' or 'bulk-memory' features.
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -pthread -s TOTAL_STACK=5242880")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread -s TOTAL_STACK=5242880")
else()
if (MINGW)
set(BUILD_SHARED_LIBS_DEFAULT OFF)
@ -33,531 +48,136 @@ else()
endif()
endif()
# options
option(BUILD_SHARED_LIBS "build shared libraries" ${BUILD_SHARED_LIBS_DEFAULT})
if (APPLE)
set(WHISPER_METAL_DEFAULT ON)
else()
set(WHISPER_METAL_DEFAULT OFF)
endif()
#
# option list
#
option(BUILD_SHARED_LIBS "whisper: build shared libs" ${BUILD_SHARED_LIBS_DEFAULT})
# general
option(WHISPER_CCACHE "whisper: use ccache if available" ON)
# debug
option(WHISPER_ALL_WARNINGS "whisper: enable all compiler warnings" ON)
option(WHISPER_ALL_WARNINGS_3RD_PARTY "whisper: enable all compiler warnings in 3rd party libs" OFF)
option(WHISPER_SANITIZE_THREAD "whisper: enable thread sanitizer" OFF)
option(WHISPER_SANITIZE_ADDRESS "whisper: enable address sanitizer" OFF)
option(WHISPER_SANITIZE_UNDEFINED "whisper: enable undefined sanitizer" OFF)
option(WHISPER_BUILD_TESTS "whisper: build tests" ${WHISPER_STANDALONE})
option(WHISPER_BUILD_EXAMPLES "whisper: build examples" ${WHISPER_STANDALONE})
option(WHISPER_SDL2 "whisper: support for libSDL2" OFF)
option(WHISPER_NO_AVX "whisper: disable AVX" OFF)
option(WHISPER_NO_AVX2 "whisper: disable AVX2" OFF)
option(WHISPER_NO_FMA "whisper: disable FMA" OFF)
option(WHISPER_NO_F16C "whisper: disable F16c" OFF)
option(WHISPER_OPENVINO "whisper: support for OpenVINO" OFF)
if (APPLE)
option(WHISPER_NO_ACCELERATE "whisper: disable Accelerate framework" OFF)
option(WHISPER_METAL "whisper: use Metal" ${WHISPER_METAL_DEFAULT})
option(WHISPER_METAL_NDEBUG "whisper: disable Metal debugging" OFF)
option(WHISPER_COREML "whisper: enable Core ML framework" OFF)
option(WHISPER_COREML_ALLOW_FALLBACK "whisper: allow non-CoreML fallback" OFF)
else()
option(WHISPER_BLAS "whisper: use BLAS libraries" OFF)
option(WHISPER_BLAS_VENDOR "whisper: BLAS library vendor" Generic)
option(WHISPER_OPENBLAS "whisper: prefer OpenBLAS" OFF)
option(WHISPER_CUBLAS "whisper: support for cuBLAS" OFF)
option(WHISPER_HIPBLAS "whisper: support for hipBLAS" OFF)
option(WHISPER_CLBLAST "whisper: use CLBlast" OFF)
endif()
option(WHISPER_PERF "whisper: enable perf timings" OFF)
# build
option(WHISPER_FATAL_WARNINGS "whisper: enable -Werror flag" OFF)
# sanitizers
option(WHISPER_SANITIZE_THREAD "whisper: enable thread sanitizer" OFF)
option(WHISPER_SANITIZE_ADDRESS "whisper: enable address sanitizer" OFF)
option(WHISPER_SANITIZE_UNDEFINED "whisper: enable undefined sanitizer" OFF)
if (NOT MSVC)
if (WHISPER_SANITIZE_THREAD)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fsanitize=thread")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=thread")
endif()
# extra artifacts
option(WHISPER_BUILD_TESTS "whisper: build tests" ${WHISPER_STANDALONE})
option(WHISPER_BUILD_EXAMPLES "whisper: build examples" ${WHISPER_STANDALONE})
option(WHISPER_BUILD_SERVER "whisper: build server example" ${WHISPER_STANDALONE})
if (WHISPER_SANITIZE_ADDRESS)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fsanitize=address -fno-omit-frame-pointer")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address -fno-omit-frame-pointer")
endif()
# 3rd party libs
option(WHISPER_CURL "whisper: use libcurl to download model from an URL" OFF)
option(WHISPER_SDL2 "whisper: support for libSDL2" OFF)
if (WHISPER_SANITIZE_UNDEFINED)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fsanitize=undefined")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=undefined")
endif()
endif()
#set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -ffast-math")
#set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -march=native")
# dependencies
find_package(Threads REQUIRED)
# on APPLE
if (APPLE)
# include Accelerate framework
if (NOT WHISPER_NO_ACCELERATE)
find_library(ACCELERATE_FRAMEWORK Accelerate)
if (ACCELERATE_FRAMEWORK)
message(STATUS "Accelerate framework found")
set(WHISPER_EXTRA_LIBS ${WHISPER_EXTRA_LIBS} ${ACCELERATE_FRAMEWORK})
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -DGGML_USE_ACCELERATE)
else()
message(FATAL_ERROR "Accelerate framework not found")
endif()
endif()
if (WHISPER_METAL)
find_library(FOUNDATION_LIBRARY Foundation REQUIRED)
find_library(METAL_FRAMEWORK Metal REQUIRED)
find_library(METALKIT_FRAMEWORK MetalKit REQUIRED)
if (METAL_FRAMEWORK)
message(STATUS "Metal framework found")
set(WHISPER_EXTRA_LIBS ${WHISPER_EXTRA_LIBS}
${FOUNDATION_LIBRARY}
${METAL_FRAMEWORK}
${METALKIT_FRAMEWORK}
)
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -DGGML_USE_METAL)
if (WHISPER_METAL_NDEBUG)
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -DGGML_METAL_NDEBUG)
endif()
else()
message(FATAL_ERROR "Metal framework not found")
endif()
set(GGML_SOURCES_METAL ggml-metal.m ggml-metal.h)
# copy ggml-metal.metal to bin directory
configure_file(ggml-metal.metal bin/ggml-metal.metal COPYONLY)
endif()
if (WHISPER_COREML)
find_library(FOUNDATION_FRAMEWORK Foundation)
find_library(COREML_FRAMEWORK CoreML)
if (COREML_FRAMEWORK)
message(STATUS "CoreML framework found")
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -DWHISPER_USE_COREML)
else()
message(FATAL_ERROR "CoreML framework not found")
endif()
if (WHISPER_COREML_ALLOW_FALLBACK)
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -DWHISPER_COREML_ALLOW_FALLBACK)
endif()
endif()
endif()
if (WHISPER_OPENBLAS)
set(WHISPER_BLAS_VENDOR "OpenBLAS")
set(WHISPER_BLAS ON)
endif()
if (WHISPER_BLAS)
if (WIN32)
if(DEFINED ENV{OPENBLAS_PATH})
set(BLAS_LIBRARIES $ENV{OPENBLAS_PATH}/lib/libopenblas.dll.a)
message(STATUS "Libraries ${BLAS_LIBRARIES}")
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -DGGML_USE_OPENBLAS)
include_directories($ENV{OPENBLAS_PATH}/include)
set(WHISPER_EXTRA_LIBS ${WHISPER_EXTRA_LIBS} ${BLAS_LIBRARIES})
else ()
message(FATAL_ERROR "BLAS library was not found. Environment variable OPENBLAS_PATH not defined.")
endif ()
else ()
set(BLA_STATIC 1)
set(BLA_VENDOR ${WHISPER_BLAS_VENDOR})
set(BLA_SIZEOF_INTEGER 8)
set(BLA_PREFER_PKGCONFIG 1)
find_package(BLAS)
if(BLAS_FOUND)
message(STATUS "BLAS compatible library found")
message(STATUS "Libraries ${BLAS_LIBRARIES}")
find_path(BLAS_INCLUDE_DIRS cblas.h /usr/include/openblas /usr/local/include/openblas $ENV{BLAS_HOME}/include)
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -DGGML_USE_OPENBLAS)
include_directories(${BLAS_INCLUDE_DIRS})
set(WHISPER_EXTRA_LIBS ${WHISPER_EXTRA_LIBS} ${BLAS_LIBRARIES})
else()
message(FATAL_ERROR "BLAS library was not found")
endif()
endif ()
endif ()
if (WHISPER_CUBLAS)
cmake_minimum_required(VERSION 3.17)
find_package(CUDAToolkit)
if (CUDAToolkit_FOUND)
message(STATUS "cuBLAS found")
enable_language(CUDA)
set(GGML_SOURCES_CUDA ggml-cuda.cu ggml-cuda.h)
add_compile_definitions(GGML_USE_CUBLAS)
if (WHISPER_STATIC)
set(WHISPER_EXTRA_LIBS ${WHISPER_EXTRA_LIBS} CUDA::cudart_static CUDA::cublas_static CUDA::cublasLt_static)
else()
set(WHISPER_EXTRA_LIBS ${WHISPER_EXTRA_LIBS} CUDA::cudart CUDA::cublas CUDA::cublasLt)
endif()
else()
message(FATAL_ERROR "cuBLAS not found")
endif()
endif()
if (WHISPER_HIPBLAS)
list(APPEND CMAKE_PREFIX_PATH /opt/rocm)
if (NOT ${CMAKE_C_COMPILER_ID} MATCHES "Clang")
message(WARNING "Only LLVM is supported for HIP, hint: CC=/opt/rocm/llvm/bin/clang")
endif()
if (NOT ${CMAKE_CXX_COMPILER_ID} MATCHES "Clang")
message(WARNING "Only LLVM is supported for HIP, hint: CXX=/opt/rocm/llvm/bin/clang++")
endif()
find_package(hip)
find_package(hipblas)
find_package(rocblas)
if (${hipblas_FOUND} AND ${hip_FOUND})
message(STATUS "HIP and hipBLAS found")
add_compile_definitions(GGML_USE_HIPBLAS GGML_USE_CUBLAS)
add_library(ggml-rocm OBJECT ggml-cuda.cu ggml-cuda.h)
set_property(TARGET ggml-rocm PROPERTY POSITION_INDEPENDENT_CODE ON)
set_source_files_properties(ggml-cuda.cu PROPERTIES LANGUAGE CXX)
target_link_libraries(ggml-rocm PRIVATE hip::device PUBLIC hip::host roc::rocblas roc::hipblas)
if (WHISPER_STATIC)
message(FATAL_ERROR "Static linking not supported for HIP/ROCm")
endif()
set(WHISPER_EXTRA_LIBS ${WHISPER_EXTRA_LIBS} ggml-rocm)
else()
message(FATAL_ERROR "hipBLAS or HIP not found. Try setting CMAKE_PREFIX_PATH=/opt/rocm")
endif()
endif()
if (WHISPER_CLBLAST)
find_package(CLBlast)
if (CLBlast_FOUND)
message(STATUS "CLBlast found")
set(GGML_SOURCES_OPENCL ggml-opencl.cpp ggml-opencl.h)
add_compile_definitions(GGML_USE_CLBLAST)
set(WHISPER_EXTRA_LIBS ${WHISPER_EXTRA_LIBS} clblast)
else()
message(FATAL_ERROR "CLBlast not found")
endif()
endif()
if( WHISPER_OPENVINO )
find_package(OpenVINO REQUIRED COMPONENTS Runtime)
endif()
# compiler flags
if (NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "RelWithDebInfo")
endif ()
if (WHISPER_ALL_WARNINGS)
if (NOT MSVC)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} \
-Wall \
-Wextra \
-Wpedantic \
-Wshadow \
-Wcast-qual \
-Wstrict-prototypes \
-Wpointer-arith \
-Wno-unused-function \
")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} \
-Wall \
-Wextra \
-Wpedantic \
-Wcast-qual \
")
else()
# todo : msvc
endif()
endif()
if (NOT MSVC)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -Werror=vla")
#set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fno-math-errno -ffinite-math-only -funsafe-math-optimizations")
endif()
message(STATUS "CMAKE_SYSTEM_PROCESSOR: ${CMAKE_SYSTEM_PROCESSOR}")
if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "arm" OR ${CMAKE_SYSTEM_PROCESSOR} MATCHES "aarch64")
message(STATUS "ARM detected")
elseif(${CMAKE_SYSTEM_PROCESSOR} MATCHES "ppc64le")
message(STATUS "PowerPC detected")
else()
message(STATUS "x86 detected")
if (MSVC)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /utf-8")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /utf-8")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} /utf-8")
if(NOT WHISPER_NO_AVX2)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX2")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /arch:AVX2")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} /arch:AVX2")
else()
if(NOT WHISPER_NO_AVX)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /arch:AVX")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} /arch:AVX")
endif()
endif()
else()
if (EMSCRIPTEN)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -pthread")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread")
else()
if(NOT WHISPER_NO_AVX)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mavx")
endif()
if(NOT WHISPER_NO_AVX2)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mavx2")
endif()
if(NOT WHISPER_NO_FMA)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mfma")
endif()
if(NOT WHISPER_NO_F16C)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mf16c")
endif()
endif()
endif()
endif()
#
# POSIX conformance
#
# clock_gettime came in POSIX.1b (1993)
# CLOCK_MONOTONIC came in POSIX.1-2001 / SUSv3 as optional
# posix_memalign came in POSIX.1-2001 / SUSv3
# M_PI is an XSI extension since POSIX.1-2001 / SUSv3, came in XPG1 (1985)
add_compile_definitions(_XOPEN_SOURCE=600)
# Somehow in OpenBSD whenever POSIX conformance is specified
# some string functions rely on locale_t availability,
# which was introduced in POSIX.1-2008, forcing us to go higher
if (CMAKE_SYSTEM_NAME MATCHES "OpenBSD")
remove_definitions(-D_XOPEN_SOURCE=600)
add_compile_definitions(_XOPEN_SOURCE=700)
endif()
# Data types, macros and functions related to controlling CPU affinity
# are available on Linux through GNU extensions in libc
if (CMAKE_SYSTEM_NAME MATCHES "Linux")
add_compile_definitions(_GNU_SOURCE)
option(WHISPER_FFMPEG "whisper: support building and linking with ffmpeg libs (avcodec, swresample, ...)" OFF)
endif()
# RLIMIT_MEMLOCK came in BSD, is not specified in POSIX.1,
# and on macOS its availability depends on enabling Darwin extensions
# similarly on DragonFly, enabling BSD extensions is necessary
if (CMAKE_SYSTEM_NAME MATCHES "Darwin")
add_compile_definitions(_DARWIN_C_SOURCE)
endif()
if (CMAKE_SYSTEM_NAME MATCHES "DragonFly")
add_compile_definitions(_DARWIN_C_SOURCE)
endif()
option(WHISPER_COREML "whisper: enable Core ML framework" OFF)
option(WHISPER_COREML_ALLOW_FALLBACK "whisper: allow non-CoreML fallback" OFF)
option(WHISPER_OPENVINO "whisper: support for OpenVINO" OFF)
# alloca is a non-standard interface that is not visible on BSDs when
# POSIX conformance is specified, but not all of them provide a clean way
# to enable it in such cases
if (CMAKE_SYSTEM_NAME MATCHES "FreeBSD")
add_compile_definitions(__BSD_VISIBLE)
endif()
if (CMAKE_SYSTEM_NAME MATCHES "NetBSD")
add_compile_definitions(_NETBSD_SOURCE)
endif()
if (CMAKE_SYSTEM_NAME MATCHES "OpenBSD")
add_compile_definitions(_BSD_SOURCE)
endif()
# Required for relocatable CMake package
include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/build-info.cmake)
if (WHISPER_PERF)
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -DGGML_PERF)
endif()
# override ggml options
set(GGML_CCACHE ${WHISPER_CCACHE})
set(GGML_SANITIZE_THREAD ${WHISPER_SANITIZE_THREAD})
set(GGML_SANITIZE_ADDRESS ${WHISPER_SANITIZE_ADDRESS})
set(GGML_SANITIZE_UNDEFINED ${WHISPER_SANITIZE_UNDEFINED})
set(GGML_ALL_WARNINGS ${WHISPER_ALL_WARNINGS})
set(GGML_FATAL_WARNINGS ${WHISPER_FATAL_WARNINGS})
#
# whisper.coreml - Core ML support
#
if (WHISPER_COREML)
set(TARGET whisper.coreml)
add_library(${TARGET}
coreml/whisper-encoder.h
coreml/whisper-encoder.mm
coreml/whisper-encoder-impl.h
coreml/whisper-encoder-impl.m
)
include(DefaultTargetOptions)
target_include_directories(${TARGET} PUBLIC
.
)
target_link_libraries(${TARGET} PRIVATE ${FOUNDATION_FRAMEWORK} ${COREML_FRAMEWORK})
set_target_properties(${TARGET} PROPERTIES
COMPILE_FLAGS "-fobjc-arc"
)
endif()
if (WHISPER_OPENVINO)
set(TARGET whisper.openvino)
add_library(${TARGET} OBJECT
openvino/whisper-openvino-encoder.h
openvino/whisper-openvino-encoder.cpp
)
target_include_directories(${TARGET} PUBLIC
.
)
set_property(TARGET ${TARGET} PROPERTY POSITION_INDEPENDENT_CODE ON)
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -DWHISPER_USE_OPENVINO)
target_link_libraries(${TARGET} PRIVATE openvino::runtime)
endif()
#
# whisper - this is the main library of the project
#
set(TARGET whisper)
add_library(${TARGET}
ggml.h
ggml.c
ggml-alloc.h
ggml-alloc.c
ggml-backend.h
ggml-backend.c
ggml-quants.h
ggml-quants.c
${GGML_SOURCES_METAL}
${GGML_SOURCES_CUDA}
${GGML_SOURCES_OPENCL}
whisper.h
whisper.cpp
)
include(DefaultTargetOptions)
target_include_directories(${TARGET} PUBLIC
.
)
if (WHISPER_COREML)
target_link_libraries(${TARGET} PRIVATE whisper.coreml)
endif()
if (WHISPER_OPENVINO)
target_link_libraries(${TARGET} PRIVATE whisper.openvino)
endif()
if (MSVC)
target_link_libraries(${TARGET} PRIVATE ${WHISPER_EXTRA_LIBS} ${CMAKE_THREAD_LIBS_INIT})
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -D_CRT_SECURE_NO_WARNINGS)
else()
target_link_libraries(${TARGET} PRIVATE m ${WHISPER_EXTRA_LIBS} ${CMAKE_THREAD_LIBS_INIT})
endif()
if (BUILD_SHARED_LIBS)
target_link_libraries(${TARGET} PUBLIC
${CMAKE_DL_LIBS}
)
target_compile_definitions(${TARGET} PUBLIC
WHISPER_SHARED
GGML_SHARED
)
target_compile_definitions(${TARGET} PRIVATE
WHISPER_BUILD
GGML_BUILD
)
if (WHISPER_METAL)
# TODO: I think this should make ggml-metal.m "see" the ggml-metal.metal file from the "bin" directory
# but for some reason it does not work here like it does in llama.cpp
set_target_properties(${TARGET} PROPERTIES RESOURCE "${CMAKE_CURRENT_SOURCE_DIR}/ggml-metal.metal")
# transition helpers
function (whisper_option_depr TYPE OLD NEW)
if (${OLD})
message(${TYPE} "${OLD} is deprecated and will be removed in the future.\nUse ${NEW} instead\n")
set(${NEW} ON)
endif()
endfunction()
whisper_option_depr(FATAL_ERROR WHISPER_CUBLAS GGML_CUDA)
whisper_option_depr(WARNING WHISPER_CUDA GGML_CUDA)
whisper_option_depr(WARNING WHISPER_KOMPUTE GGML_KOMPUTE)
whisper_option_depr(WARNING WHISPER_METAL GGML_METAL)
whisper_option_depr(WARNING WHISPER_METAL_EMBED_LIBRARY GGML_METAL_EMBED_LIBRARY)
whisper_option_depr(WARNING WHISPER_NATIVE GGML_NATIVE)
whisper_option_depr(WARNING WHISPER_OPENMP GGML_OPENMP)
whisper_option_depr(WARNING WHISPER_RPC GGML_RPC)
whisper_option_depr(WARNING WHISPER_SYCL GGML_SYCL)
whisper_option_depr(WARNING WHISPER_SYCL_F16 GGML_SYCL_F16)
#
# build the library
#
if (NOT TARGET ggml)
add_subdirectory(ggml)
# ... otherwise assume ggml is added by a parent CMakeLists.txt
endif()
add_subdirectory(src)
if (GGML_SOURCES_CUDA)
message(STATUS "GGML CUDA sources found, configuring CUDA architecture")
set_property(TARGET whisper PROPERTY CUDA_ARCHITECTURES OFF)
set_property(TARGET whisper PROPERTY CUDA_SELECT_NVCC_ARCH_FLAGS "Auto")
endif()
if (EMSCRIPTEN)
set_target_properties(${TARGET} PROPERTIES COMPILE_FLAGS "-msimd128")
endif()
target_compile_definitions(${TARGET} PUBLIC
${WHISPER_EXTRA_FLAGS}
)
set_target_properties(${TARGET} PROPERTIES PUBLIC_HEADER "whisper.h")
#
# install
#
include(GNUInstallDirs)
include(CMakePackageConfigHelpers)
install(TARGETS ${TARGET}
LIBRARY DESTINATION lib
ARCHIVE DESTINATION lib/static
RUNTIME DESTINATION bin
RESOURCE DESTINATION bin
PUBLIC_HEADER DESTINATION include
)
set(WHISPER_BUILD_NUMBER ${BUILD_NUMBER})
set(WHISPER_BUILD_COMMIT ${BUILD_COMMIT})
set(WHISPER_INSTALL_VERSION ${CMAKE_PROJECT_VERSION})
#
# bindings
#
set(WHISPER_INCLUDE_INSTALL_DIR ${CMAKE_INSTALL_INCLUDEDIR} CACHE PATH "Location of header files")
set(WHISPER_LIB_INSTALL_DIR ${CMAKE_INSTALL_LIBDIR} CACHE PATH "Location of library files")
set(WHISPER_BIN_INSTALL_DIR ${CMAKE_INSTALL_BINDIR} CACHE PATH "Location of binary files")
add_subdirectory(bindings)
get_directory_property(WHISPER_TRANSIENT_DEFINES COMPILE_DEFINITIONS)
set_target_properties(whisper PROPERTIES PUBLIC_HEADER ${CMAKE_CURRENT_SOURCE_DIR}/include/whisper.h)
install(TARGETS whisper LIBRARY PUBLIC_HEADER)
configure_package_config_file(
${CMAKE_CURRENT_SOURCE_DIR}/cmake/whisper-config.cmake.in
${CMAKE_CURRENT_BINARY_DIR}/whisper-config.cmake
INSTALL_DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/whisper
PATH_VARS
WHISPER_INCLUDE_INSTALL_DIR
WHISPER_LIB_INSTALL_DIR
WHISPER_BIN_INSTALL_DIR )
write_basic_package_version_file(
${CMAKE_CURRENT_BINARY_DIR}/whisper-version.cmake
VERSION ${WHISPER_INSTALL_VERSION}
COMPATIBILITY SameMajorVersion)
install(FILES ${CMAKE_CURRENT_BINARY_DIR}/whisper-config.cmake
${CMAKE_CURRENT_BINARY_DIR}/whisper-version.cmake
DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/whisper)
configure_file(cmake/whisper.pc.in
"${CMAKE_CURRENT_BINARY_DIR}/whisper.pc"
@ONLY)
install(FILES "${CMAKE_CURRENT_BINARY_DIR}/whisper.pc"
DESTINATION lib/pkgconfig)
#
# programs, examples and tests
#
if (WHISPER_BUILD_TESTS AND NOT CMAKE_JS_VERSION)
enable_testing()
add_subdirectory(tests)
#include(CTest)
#add_subdirectory(tests)
endif ()
if (WHISPER_BUILD_EXAMPLES)

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2023 Georgi Gerganov
Copyright (c) 2023-2024 The ggml authors
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

1230
Makefile

File diff suppressed because it is too large Load Diff

62
Package.swift Normal file
View File

@ -0,0 +1,62 @@
// swift-tools-version:5.5
import PackageDescription
let package = Package(
name: "whisper",
platforms: [
.macOS(.v12),
.iOS(.v14),
.watchOS(.v4),
.tvOS(.v14)
],
products: [
.library(name: "whisper", targets: ["whisper"]),
],
targets: [
.target(
name: "whisper",
path: ".",
exclude: [
"build",
"bindings",
"cmake",
"examples",
"scripts",
"models",
"samples",
"tests",
"CMakeLists.txt",
"Makefile",
"ggml/src/ggml-metal-embed.metal"
],
sources: [
"ggml/src/ggml.c",
"src/whisper.cpp",
"ggml/src/ggml-aarch64.c",
"ggml/src/ggml-alloc.c",
"ggml/src/ggml-backend.cpp",
"ggml/src/ggml-cpu.c",
"ggml/src/ggml-quants.c",
"ggml/src/ggml-metal.m"
],
resources: [.process("ggml/src/ggml-metal.metal")],
publicHeadersPath: "spm-headers",
cSettings: [
.unsafeFlags(["-Wno-shorten-64-to-32", "-O3", "-DNDEBUG"]),
.define("GGML_USE_ACCELERATE"),
.unsafeFlags(["-fno-objc-arc"]),
.define("GGML_USE_METAL")
// NOTE: NEW_LAPACK will required iOS version 16.4+
// We should consider add this in the future when we drop support for iOS 14
// (ref: ref: https://developer.apple.com/documentation/accelerate/1513264-cblas_sgemm?language=objc)
// .define("ACCELERATE_NEW_LAPACK"),
// .define("ACCELERATE_LAPACK_ILP64")
],
linkerSettings: [
.linkedFramework("Accelerate")
]
)
],
cxxLanguageStandard: .cxx11
)

359
README.md
View File

@ -4,26 +4,26 @@
[![Actions Status](https://github.com/ggerganov/whisper.cpp/workflows/CI/badge.svg)](https://github.com/ggerganov/whisper.cpp/actions)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Conan Center](https://shields.io/conan/v/whisper-cpp)](https://conan.io/center/whisper-cpp)
[![npm](https://img.shields.io/npm/v/whisper.cpp.svg)](https://www.npmjs.com/package/whisper.cpp/)
Beta: [v1.4.2](https://github.com/ggerganov/whisper.cpp/releases/tag/v1.4.2) / Stable: [v1.2.1](https://github.com/ggerganov/whisper.cpp/releases/tag/v1.2.1) / [Roadmap | F.A.Q.](https://github.com/ggerganov/whisper.cpp/discussions/126)
Stable: [v1.7.2](https://github.com/ggerganov/whisper.cpp/releases/tag/v1.7.2) / [Roadmap | F.A.Q.](https://github.com/ggerganov/whisper.cpp/discussions/126)
High-performance inference of [OpenAI's Whisper](https://github.com/openai/whisper) automatic speech recognition (ASR) model:
- Plain C/C++ implementation without dependencies
- Apple Silicon first-class citizen - optimized via ARM NEON, Accelerate framework, Metal and [Core ML](https://github.com/ggerganov/whisper.cpp#core-ml-support)
- Apple Silicon first-class citizen - optimized via ARM NEON, Accelerate framework, Metal and [Core ML](#core-ml-support)
- AVX intrinsics support for x86 architectures
- VSX intrinsics support for POWER architectures
- Mixed F16 / F32 precision
- [4-bit and 5-bit integer quantization support](https://github.com/ggerganov/whisper.cpp#quantization)
- Low memory usage (Flash Attention)
- [Integer quantization support](#quantization)
- Zero memory allocations at runtime
- [Vulkan support](#vulkan-gpu-support)
- Support for CPU-only inference
- [Partial GPU support for NVIDIA via cuBLAS](https://github.com/ggerganov/whisper.cpp#nvidia-gpu-support-via-cublas)
- [Partial OpenCL GPU support via CLBlast](https://github.com/ggerganov/whisper.cpp#opencl-gpu-support-via-clblast)
- [BLAS CPU support via OpenBLAS](https://github.com/ggerganov/whisper.cpp#blas-cpu-support-via-openblas)
- [OpenVINO Support](https://github.com/ggerganov/whisper.cpp#openvino-support)
- [C-style API](https://github.com/ggerganov/whisper.cpp/blob/master/whisper.h)
- [Efficient GPU support for NVIDIA](#nvidia-gpu-support)
- [OpenVINO Support](#openvino-support)
- [Ascend NPU Support](#ascend-npu-support)
- [C-style API](https://github.com/ggerganov/whisper.cpp/blob/master/include/whisper.h)
Supported platforms:
@ -35,11 +35,10 @@ Supported platforms:
- [x] [WebAssembly](examples/whisper.wasm)
- [x] Windows ([MSVC](https://github.com/ggerganov/whisper.cpp/blob/master/.github/workflows/build.yml#L117-L144) and [MinGW](https://github.com/ggerganov/whisper.cpp/issues/168)]
- [x] [Raspberry Pi](https://github.com/ggerganov/whisper.cpp/discussions/166)
- [x] [Docker](https://github.com/ggerganov/whisper.cpp/pkgs/container/whisper.cpp)
The entire implementation of the model is contained in 2 source files:
- Tensor operations: [ggml.h](ggml.h) / [ggml.c](ggml.c)
- Transformer inference: [whisper.h](whisper.h) / [whisper.cpp](whisper.cpp)
The entire high-level implementation of the model is contained in [whisper.h](include/whisper.h) and [whisper.cpp](src/whisper.cpp).
The rest of the code is part of the [`ggml`](https://github.com/ggerganov/ggml) machine learning library.
Having such a lightweight implementation of the model allows to easily integrate it in different platforms and applications.
As an example, here is a video of running the model on an iPhone 13 device - fully offline, on-device: [whisper.objc](examples/whisper.objc)
@ -58,33 +57,39 @@ Or you can even run it straight in the browser: [talk.wasm](examples/talk.wasm)
## Implementation details
- The core tensor operations are implemented in C ([ggml.h](ggml.h) / [ggml.c](ggml.c))
- The transformer model and the high-level C-style API are implemented in C++ ([whisper.h](whisper.h) / [whisper.cpp](whisper.cpp))
- The core tensor operations are implemented in C ([ggml.h](ggml/include/ggml.h) / [ggml.c](ggml/src/ggml.c))
- The transformer model and the high-level C-style API are implemented in C++ ([whisper.h](include/whisper.h) / [whisper.cpp](src/whisper.cpp))
- Sample usage is demonstrated in [main.cpp](examples/main)
- Sample real-time audio transcription from the microphone is demonstrated in [stream.cpp](examples/stream)
- Various other examples are available in the [examples](examples) folder
The tensor operators are optimized heavily for Apple silicon CPUs. Depending on the computation size, Arm Neon SIMD
intrinsics or CBLAS Accelerate framework routines are used. The latter are especially effective for bigger sizes since
the Accelerate framework utilizes the special-purpose AMX coprocessor available in modern Apple products.
The tensor operators are optimized heavily for Apple silicon CPUs. Depending on the computation size, Arm Neon SIMD intrinsics or CBLAS Accelerate framework routines are used. The latter are especially effective for bigger sizes since the Accelerate framework utilizes the special-purpose AMX coprocessor available in modern Apple products.
## Quick start
First clone the repository.
Then, download one of the Whisper models converted in [ggml format](models). For example:
First clone the repository:
```bash
bash ./models/download-ggml-model.sh base.en
git clone https://github.com/ggerganov/whisper.cpp.git
```
If you wish to convert the Whisper models to ggml format yourself, instructions are in [models/README.md](models/README.md).
Navigate into the directory:
```
cd whisper.cpp
```
Then, download one of the Whisper [models](models/README.md) converted in [`ggml` format](#ggml-format). For example:
```bash
sh ./models/download-ggml-model.sh base.en
```
Now build the [main](examples/main) example and transcribe an audio file like this:
```bash
# build the main example
make
make -j
# transcribe an audio file
./main -f samples/jfk.wav
@ -94,8 +99,8 @@ make
For a quick demo, simply run `make base.en`:
```java
$ make base.en
```text
$ make -j base.en
cc -I. -O3 -std=c11 -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp -o whisper.o
@ -114,8 +119,8 @@ options:
-mc N, --max-context N [-1 ] maximum number of text context tokens to store
-ml N, --max-len N [0 ] maximum segment length in characters
-sow, --split-on-word [false ] split on word rather than on token
-bo N, --best-of N [2 ] number of best candidates to keep
-bs N, --beam-size N [-1 ] beam size for beam search
-bo N, --best-of N [5 ] number of best candidates to keep
-bs N, --beam-size N [5 ] beam size for beam search
-wt N, --word-thold N [0.01 ] word timestamp probability threshold
-et N, --entropy-thold N [2.40 ] entropy threshold for decoder fail
-lpt N, --logprob-thold N [-1.00 ] log probability threshold for decoder fail
@ -132,6 +137,7 @@ options:
-fp, --font-path [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
-ocsv, --output-csv [false ] output result in a CSV file
-oj, --output-json [false ] output result in a JSON file
-ojf, --output-json-full [false ] include more information in the JSON file
-of FNAME, --output-file FNAME [ ] output file path (without file extension)
-ps, --print-special [false ] print special tokens
-pc, --print-colors [false ] print colors
@ -143,10 +149,11 @@ options:
-m FNAME, --model FNAME [models/ggml-base.en.bin] model path
-f FNAME, --file FNAME [ ] input WAV file path
-oved D, --ov-e-device DNAME [CPU ] the OpenVINO device used for encode inference
-ls, --log-score [false ] log best decoder scores of token
-ls, --log-score [false ] log best decoder scores of tokens
-ng, --no-gpu [false ] disable GPU
bash ./models/download-ggml-model.sh base.en
sh ./models/download-ggml-model.sh base.en
Downloading ggml model base.en ...
ggml-base.en.bin 100%[========================>] 141.11M 6.34MB/s in 24s
Done! Model 'base.en' saved in 'models/ggml-base.en.bin'
@ -208,7 +215,7 @@ For detailed usage instructions, run: `./main -h`
Note that the [main](examples/main) example currently runs only with 16-bit WAV files, so make sure to convert your input before running the tool.
For example, you can use `ffmpeg` like this:
```java
```bash
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
```
@ -217,7 +224,7 @@ ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
If you want some extra audio samples to play with, simply run:
```
make samples
make -j samples
```
This will download a few more audio files from Wikipedia and convert them to 16-bit WAV format via `ffmpeg`.
@ -225,28 +232,29 @@ This will download a few more audio files from Wikipedia and convert them to 16-
You can download and run the other models as follows:
```
make tiny.en
make tiny
make base.en
make base
make small.en
make small
make medium.en
make medium
make large-v1
make large-v2
make large
make -j tiny.en
make -j tiny
make -j base.en
make -j base
make -j small.en
make -j small
make -j medium.en
make -j medium
make -j large-v1
make -j large-v2
make -j large-v3
make -j large-v3-turbo
```
## Memory usage
| Model | Disk | Mem | SHA |
| --- | --- | --- | --- |
| tiny | 75 MB | ~125 MB | `bd577a113a864445d4c299885e0cb97d4ba92b5f` |
| base | 142 MB | ~210 MB | `465707469ff3a37a2b9b8d8f89f2f99de7299dac` |
| small | 466 MB | ~600 MB | `55356645c2b361a969dfd0ef2c5a50d530afd8d5` |
| medium | 1.5 GB | ~1.7 GB | `fd9727b6e1217c2f614f9b698455c4ffd82463b4` |
| large | 2.9 GB | ~3.3 GB | `ad82bf6a9043ceed055076d0fd39f5f186ff8062` |
| Model | Disk | Mem |
| ------ | ------- | ------- |
| tiny | 75 MiB | ~273 MB |
| base | 142 MiB | ~388 MB |
| small | 466 MiB | ~852 MB |
| medium | 1.5 GiB | ~2.1 GB |
| large | 2.9 GiB | ~3.9 GB |
## Quantization
@ -257,7 +265,7 @@ Here are the steps for creating and using a quantized model:
```bash
# quantize a model with Q5_0 method
make quantize
make -j quantize
./quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0
# run the examples as usual, specifying the quantized model file
@ -279,7 +287,8 @@ speed-up - more than x3 faster compared with CPU-only execution. Here are the in
- To ensure `coremltools` operates correctly, please confirm that [Xcode](https://developer.apple.com/xcode/) is installed and execute `xcode-select --install` to install the command-line tools.
- Python 3.10 is recommended.
- [OPTIONAL] It is recommended to utilize a Python version management system, such as [Miniconda](https://docs.conda.io/en/latest/miniconda.html) for this step:
- MacOS Sonoma (version 14) or newer is recommended, as older versions of MacOS might experience issues with transcription hallucination.
- [OPTIONAL] It is recommended to utilize a Python version management system, such as [Miniconda](https://docs.conda.io/en/latest/miniconda.html) for this step:
- To create an environment, use: `conda create -n py310-whisper python=3.10 -y`
- To activate the environment, use: `conda activate py310-whisper`
@ -305,8 +314,8 @@ speed-up - more than x3 faster compared with CPU-only execution. Here are the in
- Run the examples as usual. For example:
```bash
./main -m models/ggml-base.en.bin -f samples/jfk.wav
```text
$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
...
@ -334,21 +343,23 @@ This can result in significant speedup in encoder performance. Here are the inst
- First, setup python virtual env. and install python dependencies. Python 3.10 is recommended.
Windows:
```
```powershell
cd models
python -m venv openvino_conv_env
openvino_conv_env\Scripts\activate
python -m pip install --upgrade pip
pip install -r openvino-conversion-requirements.txt
pip install -r requirements-openvino.txt
```
Linux and macOS:
```
```bash
cd models
python3 -m venv openvino_conv_env
source openvino_conv_env/bin/activate
python -m pip install --upgrade pip
pip install -r openvino-conversion-requirements.txt
pip install -r requirements-openvino.txt
```
- Generate an OpenVINO encoder model. For example, to generate a `base.en` model, use:
@ -357,7 +368,7 @@ This can result in significant speedup in encoder performance. Here are the inst
python convert-whisper-to-openvino.py --model base.en
```
This will produce ggml-base.en-encoder-openvino.xml/.bin IR model files. It's recommended to relocate these to the same folder as ggml models, as that
This will produce ggml-base.en-encoder-openvino.xml/.bin IR model files. It's recommended to relocate these to the same folder as `ggml` models, as that
is the default location that the OpenVINO extension will search at runtime.
- Build `whisper.cpp` with OpenVINO support:
@ -367,24 +378,28 @@ This can result in significant speedup in encoder performance. Here are the inst
After downloading & extracting package onto your development system, set up required environment by sourcing setupvars script. For example:
Linux:
```bash
source /path/to/l_openvino_toolkit_ubuntu22_2023.0.0.10926.b4452d56304_x86_64/setupvars.sh
```
Windows (cmd):
```
```powershell
C:\Path\To\w_openvino_toolkit_windows_2023.0.0.10926.b4452d56304_x86_64\setupvars.bat
```
And then build the project using cmake:
```bash
cmake -B build -DWHISPER_OPENVINO=1
cmake --build build -j --config Release
```
- Run the examples as usual. For example:
```bash
./main -m models/ggml-base.en.bin -f samples/jfk.wav
```text
$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
...
@ -400,44 +415,31 @@ This can result in significant speedup in encoder performance. Here are the inst
The first time run on an OpenVINO device is slow, since the OpenVINO framework will compile the IR (Intermediate Representation) model to a device-specific 'blob'. This device-specific blob will get
cached for the next run.
For more information about the Core ML implementation please refer to PR [#1037](https://github.com/ggerganov/whisper.cpp/pull/1037).
## NVIDIA GPU support via cuBLAS
## NVIDIA GPU support
With NVIDIA cards the Encoder processing can to a large extent be offloaded to the GPU through cuBLAS.
With NVIDIA cards the processing of the models is done efficiently on the GPU via cuBLAS and custom CUDA kernels.
First, make sure you have installed `cuda`: https://developer.nvidia.com/cuda-downloads
Now build `whisper.cpp` with cuBLAS support:
Now build `whisper.cpp` with CUDA support:
```
make clean
WHISPER_CUBLAS=1 make -j
GGML_CUDA=1 make -j
```
## OpenCL GPU support via CLBlast
For cards and integrated GPUs that support OpenCL, the Encoder processing can be largely offloaded to the GPU through CLBlast. This is especially useful for users with AMD APUs or low end devices for up to ~2x speedup.
First, make sure you have installed `CLBlast` for your OS or Distribution: https://github.com/CNugteren/CLBlast
Now build `whisper.cpp` with CLBlast support:
## Vulkan GPU support
Cross-vendor solution which allows you to accelerate workload on your GPU.
First, make sure your graphics card driver provides support for Vulkan API.
Now build `whisper.cpp` with Vulkan support:
```
Makefile:
cd whisper.cpp
make clean
WHISPER_CLBLAST=1 make -j
CMake:
cd whisper.cpp
cmake -B build -DWHISPER_CLBLAST=ON
cmake --build build -j --config Release
make GGML_VULKAN=1 -j
```
Run all the examples as usual.
## BLAS CPU support via OpenBLAS
Encoder processing can be accelerated on the CPU via OpenBLAS.
@ -447,9 +449,99 @@ Now build `whisper.cpp` with OpenBLAS support:
```
make clean
WHISPER_OPENBLAS=1 make -j
GGML_OPENBLAS=1 make -j
```
## BLAS CPU support via Intel MKL
Encoder processing can be accelerated on the CPU via the BLAS compatible interface of Intel's Math Kernel Library.
First, make sure you have installed Intel's MKL runtime and development packages: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-download.html
Now build `whisper.cpp` with Intel MKL BLAS support:
```
source /opt/intel/oneapi/setvars.sh
mkdir build
cd build
cmake -DWHISPER_MKL=ON ..
WHISPER_MKL=1 make -j
```
## Ascend NPU support
Ascend NPU provides inference acceleration via [`CANN`](https://www.hiascend.com/en/software/cann) and AI cores.
First, check if your Ascend NPU device is supported:
**Verified devices**
| Ascend NPU | Status |
|:-----------------------------:|:-------:|
| Atlas 300T A2 | Support |
Then, make sure you have installed [`CANN toolkit`](https://www.hiascend.com/en/software/cann/community) . The lasted version of CANN is recommanded.
Now build `whisper.cpp` with CANN support:
```
mkdir build
cd build
cmake .. -D GGML_CANN=on
make -j
```
Run the inference examples as usual, for example:
```
./build/bin/main -f samples/jfk.wav -m models/ggml-base.en.bin -t 8
```
*Notes:*
- If you have trouble with Ascend NPU device, please create a issue with **[CANN]** prefix/tag.
- If you run successfully with your Ascend NPU device, please help update the table `Verified devices`.
## Docker
### Prerequisites
- Docker must be installed and running on your system.
- Create a folder to store big models & intermediate files (ex. /whisper/models)
### Images
We have two Docker images available for this project:
1. `ghcr.io/ggerganov/whisper.cpp:main`: This image includes the main executable file as well as `curl` and `ffmpeg`. (platforms: `linux/amd64`, `linux/arm64`)
2. `ghcr.io/ggerganov/whisper.cpp:main-cuda`: Same as `main` but compiled with CUDA support. (platforms: `linux/amd64`)
### Usage
```shell
# download model and persist it in a local folder
docker run -it --rm \
-v path/to/models:/models \
whisper.cpp:main "./models/download-ggml-model.sh base /models"
# transcribe an audio file
docker run -it --rm \
-v path/to/models:/models \
-v path/to/audios:/audios \
whisper.cpp:main "./main -m /models/ggml-base.bin -f /audios/jfk.wav"
# transcribe an audio file in samples folder
docker run -it --rm \
-v path/to/models:/models \
whisper.cpp:main "./main -m /models/ggml-base.bin -f ./samples/jfk.wav"
```
## Installing with Conan
You can install pre-built binaries for whisper.cpp or build it from source using [Conan](https://conan.io/). Use the following command:
```
conan install --requires="whisper-cpp/[*]" --build=missing
```
For detailed instructions on how to use Conan, please refer to the [Conan documentation](https://docs.conan.io/2/).
## Limitations
- Inference only
@ -462,7 +554,7 @@ in about half a minute on a MacBook M1 Pro, using `medium.en` model:
<details>
<summary>Expand to see the result</summary>
```java
```text
$ ./main -m models/ggml-medium.en.bin -f samples/gb1.wav -t 8
whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
@ -534,6 +626,7 @@ whisper_print_timings: encode time = 18665.10 ms / 9 runs ( 2073.90 ms per
whisper_print_timings: decode time = 13090.93 ms / 549 runs ( 23.85 ms per run)
whisper_print_timings: total time = 32733.52 ms
```
</details>
## Real-time audio input example
@ -542,8 +635,8 @@ This is a naive example of performing real-time inference on audio from your mic
The [stream](examples/stream) tool samples the audio every half a second and runs the transcription continuously.
More info is available in [issue #10](https://github.com/ggerganov/whisper.cpp/issues/10).
```java
make stream
```bash
make stream -j
./stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000
```
@ -554,7 +647,7 @@ https://user-images.githubusercontent.com/1991296/194935793-76afede7-cfa8-48d8-a
Adding the `--print-colors` argument will print the transcribed text using an experimental color coding strategy
to highlight words with high or low confidence:
```java
```bash
./main -m models/ggml-base.en.bin -f samples/gb0.wav --print-colors
```
@ -564,8 +657,8 @@ to highlight words with high or low confidence:
For example, to limit the line length to a maximum of 16 characters, simply add `-ml 16`:
```java
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16
```text
$ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16
whisper_model_load: loading model from './models/ggml-base.en.bin'
...
@ -588,8 +681,8 @@ main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 pr
The `--max-len` argument can be used to obtain word-level timestamps. Simply use `-ml 1`:
```java
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
```text
$ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
whisper_model_load: loading model from './models/ggml-base.en.bin'
...
@ -657,9 +750,9 @@ The [main](examples/main) example provides support for output of karaoke-style m
currently pronounced word is highlighted. Use the `-wts` argument and run the generated bash script.
This requires to have `ffmpeg` installed.
Here are a few *"typical"* examples:
Here are a few _"typical"_ examples:
```java
```bash
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -owts
source ./samples/jfk.wav.wts
ffplay ./samples/jfk.wav.mp4
@ -669,7 +762,7 @@ https://user-images.githubusercontent.com/1991296/199337465-dbee4b5e-9aeb-48a3-b
---
```java
```bash
./main -m ./models/ggml-base.en.bin -f ./samples/mm0.wav -owts
source ./samples/mm0.wav.wts
ffplay ./samples/mm0.wav.mp4
@ -679,7 +772,7 @@ https://user-images.githubusercontent.com/1991296/199337504-cc8fd233-0cb7-4920-9
---
```java
```bash
./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav -owts
source ./samples/gb0.wav.wts
ffplay ./samples/gb0.wav.mp4
@ -691,10 +784,10 @@ https://user-images.githubusercontent.com/1991296/199337538-b7b0c7a3-2753-4a88-a
## Video comparison of different models
Use the [extra/bench-wts.sh](https://github.com/ggerganov/whisper.cpp/blob/master/extra/bench-wts.sh) script to generate a video in the following format:
Use the [scripts/bench-wts.sh](https://github.com/ggerganov/whisper.cpp/blob/master/scripts/bench-wts.sh) script to generate a video in the following format:
```java
./extra/bench-wts.sh samples/jfk.wav
```bash
./scripts/bench-wts.sh samples/jfk.wav
ffplay ./samples/jfk.wav.all.mp4
```
@ -710,20 +803,19 @@ took to execute it. The results are summarized in the following Github issue:
[Benchmark results](https://github.com/ggerganov/whisper.cpp/issues/89)
Additionally a script to run whisper.cpp with different models and audio files is provided [bench.py](bench.py).
Additionally a script to run whisper.cpp with different models and audio files is provided [bench.py](scripts/bench.py).
You can run it with the following command, by default it will run against any standard model in the models folder.
```bash
python3 extra/bench.py -f samples/jfk.wav -t 2,4,8 -p 1,2
python3 scripts/bench.py -f samples/jfk.wav -t 2,4,8 -p 1,2
```
It is written in python with the intention of being easy to modify and extend for your benchmarking use case.
It outputs a csv file with the results of the benchmarking.
## ggml format
## `ggml` format
The original models are converted to a custom binary format. This allows to pack everything needed into a single file:
@ -738,49 +830,52 @@ or manually from here:
- https://huggingface.co/ggerganov/whisper.cpp
- https://ggml.ggerganov.com
For more details, see the conversion script [models/convert-pt-to-ggml.py](models/convert-pt-to-ggml.py) or the README
in [models](models).
For more details, see the conversion script [models/convert-pt-to-ggml.py](models/convert-pt-to-ggml.py) or [models/README.md](models/README.md).
## [Bindings](https://github.com/ggerganov/whisper.cpp/discussions/categories/bindings)
- [X] Rust: [tazz4843/whisper-rs](https://github.com/tazz4843/whisper-rs) | [#310](https://github.com/ggerganov/whisper.cpp/discussions/310)
- [X] JavaScript: [bindings/javascript](bindings/javascript) | [#309](https://github.com/ggerganov/whisper.cpp/discussions/309)
- [x] Rust: [tazz4843/whisper-rs](https://github.com/tazz4843/whisper-rs) | [#310](https://github.com/ggerganov/whisper.cpp/discussions/310)
- [x] JavaScript: [bindings/javascript](bindings/javascript) | [#309](https://github.com/ggerganov/whisper.cpp/discussions/309)
- React Native (iOS / Android): [whisper.rn](https://github.com/mybigday/whisper.rn)
- [X] Go: [bindings/go](bindings/go) | [#312](https://github.com/ggerganov/whisper.cpp/discussions/312)
- [X] Java:
- [x] Go: [bindings/go](bindings/go) | [#312](https://github.com/ggerganov/whisper.cpp/discussions/312)
- [x] Java:
- [GiviMAD/whisper-jni](https://github.com/GiviMAD/whisper-jni)
- [X] Ruby: [bindings/ruby](bindings/ruby) | [#507](https://github.com/ggerganov/whisper.cpp/discussions/507)
- [X] Objective-C / Swift: [ggerganov/whisper.spm](https://github.com/ggerganov/whisper.spm) | [#313](https://github.com/ggerganov/whisper.cpp/discussions/313)
- [x] Ruby: [bindings/ruby](bindings/ruby) | [#507](https://github.com/ggerganov/whisper.cpp/discussions/507)
- [x] Objective-C / Swift: [ggerganov/whisper.spm](https://github.com/ggerganov/whisper.spm) | [#313](https://github.com/ggerganov/whisper.cpp/discussions/313)
- [exPHAT/SwiftWhisper](https://github.com/exPHAT/SwiftWhisper)
- [X] .NET: | [#422](https://github.com/ggerganov/whisper.cpp/discussions/422)
- [x] .NET: | [#422](https://github.com/ggerganov/whisper.cpp/discussions/422)
- [sandrohanea/whisper.net](https://github.com/sandrohanea/whisper.net)
- [NickDarvey/whisper](https://github.com/NickDarvey/whisper)
- [X] Python: | [#9](https://github.com/ggerganov/whisper.cpp/issues/9)
- [x] Python: | [#9](https://github.com/ggerganov/whisper.cpp/issues/9)
- [stlukey/whispercpp.py](https://github.com/stlukey/whispercpp.py) (Cython)
- [AIWintermuteAI/whispercpp](https://github.com/AIWintermuteAI/whispercpp) (Updated fork of aarnphm/whispercpp)
- [aarnphm/whispercpp](https://github.com/aarnphm/whispercpp) (Pybind11)
- [X] R: [bnosac/audio.whisper](https://github.com/bnosac/audio.whisper)
- [X] Unity: [macoron/whisper.unity](https://github.com/Macoron/whisper.unity)
- [abdeladim-s/pywhispercpp](https://github.com/abdeladim-s/pywhispercpp) (Pybind11)
- [x] R: [bnosac/audio.whisper](https://github.com/bnosac/audio.whisper)
- [x] Unity: [macoron/whisper.unity](https://github.com/Macoron/whisper.unity)
## Examples
There are various examples of using the library for different projects in the [examples](examples) folder.
Some of the examples are even ported to run in the browser using WebAssembly. Check them out!
| Example | Web | Description |
| --- | --- | --- |
| [main](examples/main) | [whisper.wasm](examples/whisper.wasm) | Tool for translating and transcribing audio using Whisper |
| [bench](examples/bench) | [bench.wasm](examples/bench.wasm) | Benchmark the performance of Whisper on your machine |
| [stream](examples/stream) | [stream.wasm](examples/stream.wasm) | Real-time transcription of raw microphone capture |
| [command](examples/command) | [command.wasm](examples/command.wasm) | Basic voice assistant example for receiving voice commands from the mic |
| [talk](examples/talk) | [talk.wasm](examples/talk.wasm) | Talk with a GPT-2 bot |
| [talk-llama](examples/talk-llama) | | Talk with a LLaMA bot |
| [whisper.objc](examples/whisper.objc) | | iOS mobile application using whisper.cpp |
| [whisper.swiftui](examples/whisper.swiftui) | | SwiftUI iOS / macOS application using whisper.cpp |
| [whisper.android](examples/whisper.android) | | Android mobile application using whisper.cpp |
| [whisper.nvim](examples/whisper.nvim) | | Speech-to-text plugin for Neovim |
| [generate-karaoke.sh](examples/generate-karaoke.sh) | | Helper script to easily [generate a karaoke video](https://youtu.be/uj7hVta4blM) of raw audio capture |
| [livestream.sh](examples/livestream.sh) | | [Livestream audio transcription](https://github.com/ggerganov/whisper.cpp/issues/185) |
| [yt-wsp.sh](examples/yt-wsp.sh) | | Download + transcribe and/or translate any VOD [(original)](https://gist.github.com/DaniruKun/96f763ec1a037cc92fe1a059b643b818) |
| Example | Web | Description |
| --------------------------------------------------- | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| [main](examples/main) | [whisper.wasm](examples/whisper.wasm) | Tool for translating and transcribing audio using Whisper |
| [bench](examples/bench) | [bench.wasm](examples/bench.wasm) | Benchmark the performance of Whisper on your machine |
| [stream](examples/stream) | [stream.wasm](examples/stream.wasm) | Real-time transcription of raw microphone capture |
| [command](examples/command) | [command.wasm](examples/command.wasm) | Basic voice assistant example for receiving voice commands from the mic |
| [wchess](examples/wchess) | [wchess.wasm](examples/wchess) | Voice-controlled chess |
| [talk](examples/talk) | [talk.wasm](examples/talk.wasm) | Talk with a GPT-2 bot |
| [talk-llama](examples/talk-llama) | | Talk with a LLaMA bot |
| [whisper.objc](examples/whisper.objc) | | iOS mobile application using whisper.cpp |
| [whisper.swiftui](examples/whisper.swiftui) | | SwiftUI iOS / macOS application using whisper.cpp |
| [whisper.android](examples/whisper.android) | | Android mobile application using whisper.cpp |
| [whisper.nvim](examples/whisper.nvim) | | Speech-to-text plugin for Neovim |
| [generate-karaoke.sh](examples/generate-karaoke.sh) | | Helper script to easily [generate a karaoke video](https://youtu.be/uj7hVta4blM) of raw audio capture |
| [livestream.sh](examples/livestream.sh) | | [Livestream audio transcription](https://github.com/ggerganov/whisper.cpp/issues/185) |
| [yt-wsp.sh](examples/yt-wsp.sh) | | Download + transcribe and/or translate any VOD [(original)](https://gist.github.com/DaniruKun/96f763ec1a037cc92fe1a059b643b818) |
| [server](examples/server) | | HTTP transcription server with OAI-like API |
## [Discussions](https://github.com/ggerganov/whisper.cpp/discussions)

249
README_sycl.md Normal file
View File

@ -0,0 +1,249 @@
# whisper.cpp for SYCL
[Background](#background)
[OS](#os)
[Intel GPU](#intel-gpu)
[Linux](#linux)
[Environment Variable](#environment-variable)
[Known Issue](#known-issue)
[Todo](#todo)
## Background
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators<72>such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.
oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.
Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.
To avoid re-inventing the wheel, this code refers other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) (Commercial release [Intel<EFBFBD> DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) migrate to SYCL.
The whisper.cpp for SYCL is used to support Intel GPUs.
For Intel CPU, recommend to use whisper.cpp for X86 (Intel MKL build).
## OS
|OS|Status|Verified|
|-|-|-|
|Linux|Support|Ubuntu 22.04|
|Windows|Ongoing| |
## Intel GPU
|Intel GPU| Status | Verified Model|
|-|-|-|
|Intel Data Center Max Series| Support| Max 1550|
|Intel Data Center Flex Series| Support| Flex 170|
|Intel Arc Series| Support| Arc 770|
|Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake|
|Intel iGPU| Support| iGPU in i5-1250P, i7-1165G7|
## Linux
### Setup Environment
1. Install Intel GPU driver.
a. Please install Intel GPU driver by official guide: [Install GPU Drivers](https://dgpu-docs.intel.com/driver/installation.html).
Note: for iGPU, please install the client GPU driver.
b. Add user to group: video, render.
```
sudo usermod -aG render username
sudo usermod -aG video username
```
Note: re-login to enable it.
c. Check
```
sudo apt install clinfo
sudo clinfo -l
```
Output (example):
```
Platform #0: Intel(R) OpenCL Graphics
`-- Device #0: Intel(R) Arc(TM) A770 Graphics
Platform #0: Intel(R) OpenCL HD Graphics
`-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
```
2. Install Intel<65> oneAPI Base toolkit.
a. Please follow the procedure in [Get the Intel<65> oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).
Recommend to install to default folder: **/opt/intel/oneapi**.
Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.
b. Check
```
source /opt/intel/oneapi/setvars.sh
sycl-ls
```
There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**.
Output (example):
```
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.30.26918.50]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]
```
2. Build locally:
```
mkdir -p build
cd build
source /opt/intel/oneapi/setvars.sh
#for FP16
#cmake .. -DWHISPER_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DWHISPER_SYCL_F16=ON
#for FP32
cmake .. -DWHISPER_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
#build example/main only
#cmake --build . --config Release --target main
#build all binary
cmake --build . --config Release -v
```
or
```
./examples/sycl/build.sh
```
Note:
- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only.
### Run
1. Put model file to folder **models**
2. Enable oneAPI running environment
```
source /opt/intel/oneapi/setvars.sh
```
3. List device ID
Run without parameter:
```
./build/bin/ls-sycl-device
or
./build/bin/main
```
Check the ID in startup log, like:
```
found 4 SYCL devices:
Device 0: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3,
max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136
Device 1: Intel(R) FPGA Emulation Device, compute capability 1.2,
max compute_units 24, max work group size 67108864, max sub group size 64, global mem size 67065057280
Device 2: 13th Gen Intel(R) Core(TM) i7-13700K, compute capability 3.0,
max compute_units 24, max work group size 8192, max sub group size 64, global mem size 67065057280
Device 3: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0,
max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136
```
|Attribute|Note|
|-|-|
|compute capability 1.3|Level-zero running time, recommended |
|compute capability 3.0|OpenCL running time, slower than level-zero in most cases|
4. Set device ID and execute whisper.cpp
Set device ID = 0 by **GGML_SYCL_DEVICE=0**
```
GGML_SYCL_DEVICE=0 ./build/bin/main -m models/ggml-base.en.bin -f samples/jfk.wav
```
or run by script:
```
./examples/sycl/run_whisper.sh
```
5. Check the device ID in output
Like:
```
Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device
```
## Environment Variable
#### Build
|Name|Value|Function|
|-|-|-|
|WHISPER_SYCL|ON (mandatory)|Enable build with SYCL code path. <br>For FP32/FP16, WHISPER_SYCL=ON is mandatory.|
|WHISPER_SYCL_F16|ON (optional)|Enable FP16 build with SYCL code path.For FP32, do not set it.|
|CMAKE_C_COMPILER|icx|Use icx compiler for SYCL code path|
|CMAKE_CXX_COMPILER|icpx|use icpx for SYCL code path|
#### Running
|Name|Value|Function|
|-|-|-|
|GGML_SYCL_DEVICE|0 (default) or 1|Set the device id used. Check the device ids by default running output|
|GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG|
## Known Issue
- Error: `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`.
Miss to enable oneAPI running environment.
Install oneAPI base toolkit and enable it by: `source /opt/intel/oneapi/setvars.sh`.
- Hang during startup
llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.
Solution: add **--no-mmap**.
## Todo
- Support to build in Windows.
- Support multiple cards.

View File

@ -1,9 +1,31 @@
ifndef UNAME_S
UNAME_S := $(shell uname -s)
endif
ifndef UNAME_P
UNAME_P := $(shell uname -p)
endif
ifndef UNAME_M
UNAME_M := $(shell uname -m)
endif
GGML_METAL_PATH_RESOURCES := $(abspath ../..)
BUILD_DIR := build
MODELS_DIR := models
EXAMPLES_DIR := $(wildcard examples/*)
INCLUDE_PATH := $(abspath ../..)
INCLUDE_PATH := $(abspath ../../include):$(abspath ../../ggml/include)
LIBRARY_PATH := $(abspath ../..)
ifeq ($(GGML_CUDA),1)
LIBRARY_PATH := $(LIBRARY_PATH):$(CUDA_PATH)/targets/$(UNAME_M)-linux/lib/
BUILD_FLAGS := -ldflags "-extldflags '-lcudart -lcuda -lcublas'"
endif
ifeq ($(UNAME_S),Darwin)
EXT_LDFLAGS := -framework Foundation -framework Metal -framework MetalKit
endif
all: clean whisper examples
whisper: mkdir
@ -11,8 +33,13 @@ whisper: mkdir
@${MAKE} -C ../.. libwhisper.a
test: model-small whisper modtidy
ifeq ($(UNAME_S),Darwin)
@C_INCLUDE_PATH=${INCLUDE_PATH} LIBRARY_PATH=${LIBRARY_PATH} GGML_METAL_PATH_RESOURCES=${GGML_METAL_PATH_RESOURCES} go test -ldflags "-extldflags '$(EXT_LDFLAGS)'" -v .
@C_INCLUDE_PATH=${INCLUDE_PATH} LIBRARY_PATH=${LIBRARY_PATH} GGML_METAL_PATH_RESOURCES=${GGML_METAL_PATH_RESOURCES} go test -ldflags "-extldflags '$(EXT_LDFLAGS)'" -v ./pkg/whisper/...
else
@C_INCLUDE_PATH=${INCLUDE_PATH} LIBRARY_PATH=${LIBRARY_PATH} go test -v .
@C_INCLUDE_PATH=${INCLUDE_PATH} LIBRARY_PATH=${LIBRARY_PATH} go test -v ./pkg/whisper/...
endif
examples: $(EXAMPLES_DIR)
@ -21,7 +48,11 @@ model-small: mkdir examples/go-model-download
$(EXAMPLES_DIR): mkdir whisper modtidy
@echo Build example $(notdir $@)
ifeq ($(UNAME_S),Darwin)
@C_INCLUDE_PATH=${INCLUDE_PATH} LIBRARY_PATH=${LIBRARY_PATH} GGML_METAL_PATH_RESOURCES=${GGML_METAL_PATH_RESOURCES} go build ${BUILD_FLAGS} -ldflags "-extldflags '$(EXT_LDFLAGS)'" -o ${BUILD_DIR}/$(notdir $@) ./$@
else
@C_INCLUDE_PATH=${INCLUDE_PATH} LIBRARY_PATH=${LIBRARY_PATH} go build ${BUILD_FLAGS} -o ${BUILD_DIR}/$(notdir $@) ./$@
endif
mkdir:
@echo Mkdir ${BUILD_DIR}

View File

@ -62,6 +62,12 @@ This will compile a static `libwhisper.a` in a `build` folder, download a model
make examples
```
To build using cuda support add `GGML_CUDA=1`:
```bash
GGML_CUDA=1 make examples
```
The examples are placed in the `build` directory. Once built, you can download all the models with the following command:
```bash

View File

@ -24,7 +24,7 @@ const (
var (
// The models which will be downloaded, if no model is specified as an argument
modelNames = []string{"ggml-tiny.en", "ggml-tiny", "ggml-base.en", "ggml-base", "ggml-small.en", "ggml-small", "ggml-medium.en", "ggml-medium", "ggml-large-v1", "ggml-large-v2", "ggml-large"}
modelNames = []string{"ggml-tiny.en", "ggml-tiny", "ggml-base.en", "ggml-base", "ggml-small.en", "ggml-small", "ggml-medium.en", "ggml-medium", "ggml-large-v1", "ggml-large-v2", "ggml-large-v3", "large-v3-turbo"}
)
var (

View File

@ -68,10 +68,6 @@ func (flags *Flags) GetOut() string {
return strings.ToLower(flags.Lookup("out").Value.String())
}
func (flags *Flags) IsSpeedup() bool {
return flags.Lookup("speedup").Value.String() == "true"
}
func (flags *Flags) IsTokens() bool {
return flags.Lookup("tokens").Value.String() == "true"
}
@ -111,10 +107,6 @@ func (flags *Flags) SetParams(context whisper.Context) error {
fmt.Fprintf(flags.Output(), "Setting duration to %v\n", duration)
context.SetDuration(duration)
}
if flags.IsSpeedup() {
fmt.Fprintf(flags.Output(), "Setting speedup to true\n")
context.SetSpeedup(true)
}
if threads := flags.GetThreads(); threads != 0 {
fmt.Fprintf(flags.Output(), "Setting threads to %d\n", threads)
context.SetThreads(threads)
@ -146,7 +138,6 @@ func registerFlags(flag *Flags) {
flag.Duration("offset", 0, "Time offset")
flag.Duration("duration", 0, "Duration of audio to process")
flag.Uint("threads", 0, "Number of threads to use")
flag.Bool("speedup", false, "Enable speedup")
flag.Uint("max-len", 0, "Maximum segment length in characters")
flag.Uint("max-tokens", 0, "Maximum tokens per segment")
flag.Float64("word-thold", 0, "Maximum segment score")

View File

@ -1,10 +1,10 @@
module github.com/ggerganov/whisper.cpp/bindings/go
go 1.19
go 1.23
require (
github.com/go-audio/wav v1.1.0
github.com/stretchr/testify v1.8.1
github.com/stretchr/testify v1.9.0
)
require (

View File

@ -1,4 +1,3 @@
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/go-audio/audio v1.0.0 h1:zS9vebldgbQqktK4H0lUqWrG8P0NxCJVqcj7ZpNnwd4=
@ -9,15 +8,9 @@ github.com/go-audio/wav v1.1.0 h1:jQgLtbqBzY7G+BM8fXF7AHUk1uHUviWS4X39d5rsL2g=
github.com/go-audio/wav v1.1.0/go.mod h1:mpe9qfwbScEbkd8uybLuIpTgHyrISw/OTuvjUW2iGtE=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
github.com/stretchr/testify v1.8.1 h1:w7B6lhMri9wdJUVmEZPGGhZzrYTPvgJArz7wNPgYKsk=
github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4=
github.com/stretchr/testify v1.9.0 h1:HtqpIVDClZ4nwg75+f6Lvsy/wHu+3BoSGCbBAcpTsTg=
github.com/stretchr/testify v1.9.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=

View File

@ -47,10 +47,6 @@ func (p *Params) SetPrintTimestamps(v bool) {
p.print_timestamps = toBool(v)
}
func (p *Params) SetSpeedup(v bool) {
p.speed_up = toBool(v)
}
// Set language id
func (p *Params) SetLanguage(lang int) error {
if lang == -1 {
@ -123,6 +119,33 @@ func (p *Params) SetAudioCtx(n int) {
p.audio_ctx = C.int(n)
}
func (p *Params) SetMaxContext(n int) {
p.n_max_text_ctx = C.int(n)
}
func (p *Params) SetBeamSize(n int) {
p.beam_search.beam_size = C.int(n)
}
func (p *Params) SetEntropyThold(t float32) {
p.entropy_thold = C.float(t)
}
func (p *Params) SetTemperature(t float32) {
p.temperature = C.float(t)
}
// Sets the fallback temperature incrementation
// Pass -1.0 to disable this feature
func (p *Params) SetTemperatureFallback(t float32) {
p.temperature_inc = C.float(t)
}
// Set initial prompt
func (p *Params) SetInitialPrompt(prompt string) {
p.initial_prompt = C.CString(prompt)
}
///////////////////////////////////////////////////////////////////////////////
// PRIVATE METHODS
@ -147,6 +170,11 @@ func (p *Params) String() string {
str += fmt.Sprintf(" offset_ms=%d", p.offset_ms)
str += fmt.Sprintf(" duration_ms=%d", p.duration_ms)
str += fmt.Sprintf(" audio_ctx=%d", p.audio_ctx)
str += fmt.Sprintf(" initial_prompt=%s", C.GoString(p.initial_prompt))
str += fmt.Sprintf(" entropy_thold=%f", p.entropy_thold)
str += fmt.Sprintf(" temperature=%f", p.temperature)
str += fmt.Sprintf(" temperature_inc=%f", p.temperature_inc)
str += fmt.Sprintf(" beam_size=%d", p.beam_search.beam_size)
if p.translate {
str += " translate"
}
@ -171,9 +199,6 @@ func (p *Params) String() string {
if p.token_timestamps {
str += " token_timestamps"
}
if p.speed_up {
str += " speed_up"
}
return str + ">"
}

View File

@ -76,11 +76,6 @@ func (context *context) SetTranslate(v bool) {
context.params.SetTranslate(v)
}
// Set speedup flag
func (context *context) SetSpeedup(v bool) {
context.params.SetSpeedup(v)
}
func (context *context) SetSplitOnWord(v bool) {
context.params.SetSplitOnWord(v)
}
@ -130,6 +125,37 @@ func (context *context) SetAudioCtx(n uint) {
context.params.SetAudioCtx(int(n))
}
// Set maximum number of text context tokens to store
func (context *context) SetMaxContext(n int) {
context.params.SetMaxContext(n)
}
// Set Beam Size
func (context *context) SetBeamSize(n int) {
context.params.SetBeamSize(n)
}
// Set Entropy threshold
func (context *context) SetEntropyThold(t float32) {
context.params.SetEntropyThold(t)
}
// Set Temperature
func (context *context) SetTemperature(t float32) {
context.params.SetTemperature(t)
}
// Set the fallback temperature incrementation
// Pass -1.0 to disable this feature
func (context *context) SetTemperatureFallback(t float32) {
context.params.SetTemperatureFallback(t)
}
// Set initial prompt
func (context *context) SetInitialPrompt(prompt string) {
context.params.SetInitialPrompt(prompt)
}
// ResetTimings resets the mode timings. Should be called before processing
func (context *context) ResetTimings() {
context.model.ctx.Whisper_reset_timings()

View File

@ -4,52 +4,90 @@ import (
"os"
"testing"
// Packages
whisper "github.com/ggerganov/whisper.cpp/bindings/go/pkg/whisper"
"github.com/ggerganov/whisper.cpp/bindings/go/pkg/whisper"
"github.com/go-audio/wav"
assert "github.com/stretchr/testify/assert"
)
const (
ModelPath = "../../models/ggml-tiny.bin"
SamplePath = "../../samples/jfk.wav"
)
func Test_Whisper_000(t *testing.T) {
func TestSetLanguage(t *testing.T) {
assert := assert.New(t)
if _, err := os.Stat(ModelPath); os.IsNotExist(err) {
t.Skip("Skipping test, model not found:", ModelPath)
}
if _, err := os.Stat(SamplePath); os.IsNotExist(err) {
t.Skip("Skipping test, sample not found:", SamplePath)
}
// Load model
model, err := whisper.New(ModelPath)
assert.NoError(err)
assert.NotNil(model)
assert.NoError(model.Close())
t.Log("languages=", model.Languages())
}
func Test_Whisper_001(t *testing.T) {
assert := assert.New(t)
if _, err := os.Stat(ModelPath); os.IsNotExist(err) {
t.Skip("Skipping test, model not found:", ModelPath)
}
if _, err := os.Stat(SamplePath); os.IsNotExist(err) {
t.Skip("Skipping test, sample not found:", SamplePath)
}
// Load model
model, err := whisper.New(ModelPath)
assert.NoError(err)
assert.NotNil(model)
defer model.Close()
// Get context for decoding
ctx, err := model.NewContext()
context, err := model.NewContext()
assert.NoError(err)
assert.NotNil(ctx)
// This returns an error since
// the model 'models/ggml-small.en.bin'
// that is loaded is not multilingual
err = context.SetLanguage("en")
assert.Error(err)
}
func TestContextModelIsMultilingual(t *testing.T) {
assert := assert.New(t)
model, err := whisper.New(ModelPath)
assert.NoError(err)
assert.NotNil(model)
defer model.Close()
context, err := model.NewContext()
assert.NoError(err)
isMultilingual := context.IsMultilingual()
// This returns false since
// the model 'models/ggml-small.en.bin'
// that is loaded is not multilingual
assert.False(isMultilingual)
}
func TestLanguage(t *testing.T) {
assert := assert.New(t)
model, err := whisper.New(ModelPath)
assert.NoError(err)
assert.NotNil(model)
defer model.Close()
context, err := model.NewContext()
assert.NoError(err)
// This always returns en since
// the model 'models/ggml-small.en.bin'
// that is loaded is not multilingual
expectedLanguage := "en"
actualLanguage := context.Language()
assert.Equal(expectedLanguage, actualLanguage)
}
func TestProcess(t *testing.T) {
assert := assert.New(t)
fh, err := os.Open(SamplePath)
assert.NoError(err)
defer fh.Close()
// Decode the WAV file - load the full buffer
dec := wav.NewDecoder(fh)
buf, err := dec.FullPCMBuffer()
assert.NoError(err)
assert.Equal(uint16(1), dec.NumChans)
data := buf.AsFloat32Buffer().Data
model, err := whisper.New(ModelPath)
assert.NoError(err)
assert.NotNil(model)
defer model.Close()
context, err := model.NewContext()
assert.NoError(err)
err = context.Process(data, nil, nil)
assert.NoError(err)
}

View File

@ -38,17 +38,22 @@ type Context interface {
IsMultilingual() bool // Return true if the model is multilingual.
Language() string // Get language
SetOffset(time.Duration) // Set offset
SetDuration(time.Duration) // Set duration
SetThreads(uint) // Set number of threads to use
SetSpeedup(bool) // Set speedup flag
SetSplitOnWord(bool) // Set split on word flag
SetTokenThreshold(float32) // Set timestamp token probability threshold
SetTokenSumThreshold(float32) // Set timestamp token sum probability threshold
SetMaxSegmentLength(uint) // Set max segment length in characters
SetTokenTimestamps(bool) // Set token timestamps flag
SetMaxTokensPerSegment(uint) // Set max tokens per segment (0 = no limit)
SetAudioCtx(uint) // Set audio encoder context
SetOffset(time.Duration) // Set offset
SetDuration(time.Duration) // Set duration
SetThreads(uint) // Set number of threads to use
SetSplitOnWord(bool) // Set split on word flag
SetTokenThreshold(float32) // Set timestamp token probability threshold
SetTokenSumThreshold(float32) // Set timestamp token sum probability threshold
SetMaxSegmentLength(uint) // Set max segment length in characters
SetTokenTimestamps(bool) // Set token timestamps flag
SetMaxTokensPerSegment(uint) // Set max tokens per segment (0 = no limit)
SetAudioCtx(uint) // Set audio encoder context
SetMaxContext(n int) // Set maximum number of text context tokens to store
SetBeamSize(n int) // Set Beam Size
SetEntropyThold(t float32) // Set Entropy threshold
SetInitialPrompt(prompt string) // Set initial prompt
SetTemperature(t float32) // Set temperature
SetTemperatureFallback(t float32) // Set temperature incrementation
// Process mono audio data and return any errors.
// If defined, newly generated segments are passed to the

View File

@ -0,0 +1,91 @@
package whisper_test
import (
"testing"
"github.com/ggerganov/whisper.cpp/bindings/go/pkg/whisper"
assert "github.com/stretchr/testify/assert"
)
func TestNew(t *testing.T) {
assert := assert.New(t)
t.Run("valid model path", func(t *testing.T) {
model, err := whisper.New(ModelPath)
assert.NoError(err)
assert.NotNil(model)
defer model.Close()
})
t.Run("invalid model path", func(t *testing.T) {
invalidModelPath := "invalid-model-path.bin"
model, err := whisper.New(invalidModelPath)
assert.Error(err)
assert.Nil(model)
})
}
func TestClose(t *testing.T) {
assert := assert.New(t)
model, err := whisper.New(ModelPath)
assert.NoError(err)
assert.NotNil(model)
err = model.Close()
assert.NoError(err)
}
func TestNewContext(t *testing.T) {
assert := assert.New(t)
model, err := whisper.New(ModelPath)
assert.NoError(err)
assert.NotNil(model)
defer model.Close()
context, err := model.NewContext()
assert.NoError(err)
assert.NotNil(context)
}
func TestIsMultilingual(t *testing.T) {
assert := assert.New(t)
model, err := whisper.New(ModelPath)
assert.NoError(err)
assert.NotNil(model)
defer model.Close()
isMultilingual := model.IsMultilingual()
// This returns false since
// the model 'models/ggml-small.en.bin'
// that is loaded is not multilingual
assert.False(isMultilingual)
}
func TestLanguages(t *testing.T) {
assert := assert.New(t)
model, err := whisper.New(ModelPath)
assert.NoError(err)
assert.NotNil(model)
defer model.Close()
expectedLanguages := []string{
"en", "zh", "de", "es", "ru", "ko", "fr", "ja", "pt", "tr", "pl",
"ca", "nl", "ar", "sv", "it", "id", "hi", "fi", "vi", "he", "uk",
"el", "ms", "cs", "ro", "da", "hu", "ta", "no", "th", "ur", "hr",
"bg", "lt", "la", "mi", "ml", "cy", "sk", "te", "fa", "lv", "bn",
"sr", "az", "sl", "kn", "et", "mk", "br", "eu", "is", "hy", "ne",
"mn", "bs", "kk", "sq", "sw", "gl", "mr", "pa", "si", "km", "sn",
"yo", "so", "af", "oc", "ka", "be", "tg", "sd", "gu", "am", "yi",
"lo", "uz", "fo", "ht", "ps", "tk", "nn", "mt", "sa", "lb", "my",
"bo", "tl", "mg", "as", "tt", "haw", "ln", "ha", "ba", "jw", "su",
}
actualLanguages := model.Languages()
assert.Equal(expectedLanguages, actualLanguages)
}

View File

@ -0,0 +1,6 @@
package whisper_test
const (
ModelPath = "../../models/ggml-small.en.bin"
SamplePath = "../../samples/jfk.wav"
)

View File

@ -9,8 +9,8 @@ import (
// CGO
/*
#cgo LDFLAGS: -lwhisper -lm -lstdc++
#cgo darwin LDFLAGS: -framework Accelerate
#cgo LDFLAGS: -lwhisper -lm -lstdc++ -fopenmp
#cgo darwin LDFLAGS: -framework Accelerate -framework Metal -framework Foundation -framework CoreGraphics
#include <whisper.h>
#include <stdlib.h>

Submodule bindings/ios deleted from 44b39fd4ec

View File

@ -9,6 +9,7 @@ archivesBaseName = 'whispercpp'
group = 'io.github.ggerganov'
version = '1.4.0'
sourceCompatibility = 1.8
targetCompatibility = 1.8

View File

@ -2,6 +2,7 @@ package io.github.ggerganov.whispercpp;
import com.sun.jna.Native;
import com.sun.jna.Pointer;
import io.github.ggerganov.whispercpp.bean.WhisperSegment;
import io.github.ggerganov.whispercpp.params.WhisperContextParams;
import io.github.ggerganov.whispercpp.params.WhisperFullParams;
import io.github.ggerganov.whispercpp.params.WhisperSamplingStrategy;
@ -9,6 +10,8 @@ import io.github.ggerganov.whispercpp.params.WhisperSamplingStrategy;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
/**
* Before calling most methods, you must call `initContext(modelPath)` to initialise the `ctx` Pointer.
@ -160,6 +163,28 @@ public class WhisperCpp implements AutoCloseable {
return str.toString().trim();
}
public List<WhisperSegment> fullTranscribeWithTime(WhisperFullParams whisperParams, float[] audioData) throws IOException {
if (ctx == null) {
throw new IllegalStateException("Model not initialised");
}
if (lib.whisper_full(ctx, whisperParams, audioData, audioData.length) != 0) {
throw new IOException("Failed to process audio");
}
int nSegments = lib.whisper_full_n_segments(ctx);
List<WhisperSegment> segments= new ArrayList<>(nSegments);
for (int i = 0; i < nSegments; i++) {
long t0 = lib.whisper_full_get_segment_t0(ctx, i);
String text = lib.whisper_full_get_segment_text(ctx, i);
long t1 = lib.whisper_full_get_segment_t1(ctx, i);
segments.add(new WhisperSegment(t0,t1,text));
}
return segments;
}
// public int getTextSegmentCount(Pointer ctx) {
// return lib.whisper_full_n_segments(ctx);

View File

@ -20,7 +20,7 @@ public interface WhisperCppJnaLibrary extends Library {
* @return Whisper context on success, null on failure
*/
Pointer whisper_init_from_file(String path_model);
/**
* Provides default params which can be used with `whisper_init_from_file_with_params()` etc.
* Because this function allocates memory for the params, the caller must call either:
@ -304,14 +304,6 @@ public interface WhisperCppJnaLibrary extends Library {
/** Language id associated with the provided state */
int whisper_full_lang_id_from_state(Pointer state);
/**
* Convert RAW PCM audio to log mel spectrogram but applies a Phase Vocoder to speed up the audio x2.
* The resulting spectrogram is stored inside the default state of the provided whisper context.
* @return 0 on success
*/
int whisper_pcm_to_mel_phase_vocoder(Pointer ctx, final float[] samples, int n_samples, int n_threads);
int whisper_pcm_to_mel_phase_vocoder_with_state(Pointer ctx, Pointer state, final float[] samples, int n_samples, int n_threads);
/** Get the start time of the specified segment. */
long whisper_full_get_segment_t0(Pointer ctx, int i_segment);

View File

@ -0,0 +1,47 @@
package io.github.ggerganov.whispercpp.bean;
/**
* Created by litonglinux@qq.com on 10/21/2023_7:48 AM
*/
public class WhisperSegment {
private long start, end;
private String sentence;
public WhisperSegment() {
}
public WhisperSegment(long start, long end, String sentence) {
this.start = start;
this.end = end;
this.sentence = sentence;
}
public long getStart() {
return start;
}
public long getEnd() {
return end;
}
public String getSentence() {
return sentence;
}
public void setStart(long start) {
this.start = start;
}
public void setEnd(long end) {
this.end = end;
}
public void setSentence(String sentence) {
this.sentence = sentence;
}
@Override
public String toString() {
return "[" + start + " --> " + end + "]:" + sentence;
}
}

View File

@ -58,6 +58,9 @@ public class WhisperFullParams extends Structure {
no_context = enable ? CBool.FALSE : CBool.TRUE;
}
/** Generate timestamps or not? */
public CBool no_timestamps;
/** Flag to force single segment output (useful for streaming). (default = false) */
public CBool single_segment;
@ -126,14 +129,6 @@ public class WhisperFullParams extends Structure {
/** Maximum tokens per segment (0, default = no limit) */
public int max_tokens;
/** Flag to speed up the audio by 2x using Phase Vocoder. (default = false) */
public CBool speed_up;
/** Flag to speed up the audio by 2x using Phase Vocoder. (default = false) */
public void speedUp(boolean enable) {
speed_up = enable ? CBool.TRUE : CBool.FALSE;
}
/** Overwrite the audio context size (0 = use default). */
public int audio_ctx;
@ -145,6 +140,9 @@ public class WhisperFullParams extends Structure {
tdrz_enable = enable ? CBool.TRUE : CBool.FALSE;
}
/** Regular expression matching tokens to suppress. */
public String suppress_regex;
/** Tokens to provide to the whisper decoder as an initial prompt.
* These are prepended to any existing text context from a previous call. */
public String initial_prompt;
@ -304,18 +302,25 @@ public class WhisperFullParams extends Structure {
logits_filter_callback = CallbackReference.getFunctionPointer(callback);
}
/** Grammar stuff */
public Pointer grammar_rules;
public long n_grammar_rules;
public long i_start_rule;
public float grammar_penalty;
@Override
protected List<String> getFieldOrder() {
return Arrays.asList("strategy", "n_threads", "n_max_text_ctx", "offset_ms", "duration_ms", "translate",
"no_context", "single_segment",
"no_context", "single_segment", "no_timestamps",
"print_special", "print_progress", "print_realtime", "print_timestamps", "token_timestamps",
"thold_pt", "thold_ptsum", "max_len", "split_on_word", "max_tokens", "speed_up", "audio_ctx",
"tdrz_enable", "initial_prompt", "prompt_tokens", "prompt_n_tokens", "language", "detect_language",
"thold_pt", "thold_ptsum", "max_len", "split_on_word", "max_tokens", "audio_ctx",
"tdrz_enable", "suppress_regex", "initial_prompt", "prompt_tokens", "prompt_n_tokens", "language", "detect_language",
"suppress_blank", "suppress_non_speech_tokens", "temperature", "max_initial_ts", "length_penalty",
"temperature_inc", "entropy_thold", "logprob_thold", "no_speech_thold", "greedy", "beam_search",
"new_segment_callback", "new_segment_callback_user_data",
"progress_callback", "progress_callback_user_data",
"encoder_begin_callback", "encoder_begin_callback_user_data",
"logits_filter_callback", "logits_filter_callback_user_data");
"logits_filter_callback", "logits_filter_callback_user_data",
"grammar_rules", "n_grammar_rules", "i_start_rule", "grammar_penalty");
}
}

View File

@ -2,6 +2,7 @@ package io.github.ggerganov.whispercpp;
import static org.junit.jupiter.api.Assertions.*;
import io.github.ggerganov.whispercpp.bean.WhisperSegment;
import io.github.ggerganov.whispercpp.params.CBool;
import io.github.ggerganov.whispercpp.params.WhisperFullParams;
import io.github.ggerganov.whispercpp.params.WhisperSamplingStrategy;
@ -11,6 +12,7 @@ import javax.sound.sampled.AudioInputStream;
import javax.sound.sampled.AudioSystem;
import java.io.File;
import java.io.FileNotFoundException;
import java.util.List;
class WhisperCppTest {
private static WhisperCpp whisper = new WhisperCpp();
@ -20,11 +22,12 @@ class WhisperCppTest {
static void init() throws FileNotFoundException {
// By default, models are loaded from ~/.cache/whisper/ and are usually named "ggml-${name}.bin"
// or you can provide the absolute path to the model file.
//String modelName = "../../models/ggml-tiny.bin";
String modelName = "../../models/ggml-tiny.en.bin";
try {
whisper.initContext(modelName);
// whisper.getFullDefaultParams(WhisperSamplingStrategy.WHISPER_SAMPLING_GREEDY);
// whisper.getJavaDefaultParams(WhisperSamplingStrategy.WHISPER_SAMPLING_BEAM_SEARCH);
//whisper.getFullDefaultParams(WhisperSamplingStrategy.WHISPER_SAMPLING_GREEDY);
//whisper.getJavaDefaultParams(WhisperSamplingStrategy.WHISPER_SAMPLING_BEAM_SEARCH);
modelInitialised = true;
} catch (FileNotFoundException ex) {
System.out.println("Model " + modelName + " not found");
@ -42,7 +45,7 @@ class WhisperCppTest {
assertEquals(16384, params.n_max_text_ctx);
assertFalse(params.translate);
assertEquals(0.01f, params.thold_pt);
assertEquals(2, params.beam_search.beam_size);
assertEquals(5, params.beam_search.beam_size);
assertEquals(-1.0f, params.beam_search.patience);
}
@ -55,7 +58,7 @@ class WhisperCppTest {
assertEquals(WhisperSamplingStrategy.WHISPER_SAMPLING_GREEDY.ordinal(), params.strategy);
assertNotEquals(0, params.n_threads);
assertEquals(16384, params.n_max_text_ctx);
assertEquals(2, params.greedy.best_of);
assertEquals(5, params.greedy.best_of);
}
@Test
@ -72,11 +75,11 @@ class WhisperCppTest {
byte[] b = new byte[audioInputStream.available()];
float[] floats = new float[b.length / 2];
// WhisperFullParams params = whisper.getFullDefaultParams(WhisperSamplingStrategy.WHISPER_SAMPLING_GREEDY);
//WhisperFullParams params = whisper.getFullDefaultParams(WhisperSamplingStrategy.WHISPER_SAMPLING_GREEDY);
WhisperFullParams params = whisper.getFullDefaultParams(WhisperSamplingStrategy.WHISPER_SAMPLING_BEAM_SEARCH);
params.setProgressCallback((ctx, state, progress, user_data) -> System.out.println("progress: " + progress));
params.print_progress = CBool.FALSE;
// params.initial_prompt = "and so my fellow Americans um, like";
//params.initial_prompt = "and so my fellow Americans um, like";
try {
@ -99,4 +102,43 @@ class WhisperCppTest {
audioInputStream.close();
}
}
@Test
void testFullTranscribeWithTime() throws Exception {
if (!modelInitialised) {
System.out.println("Model not initialised, skipping test");
return;
}
// Given
File file = new File(System.getProperty("user.dir"), "../../samples/jfk.wav");
AudioInputStream audioInputStream = AudioSystem.getAudioInputStream(file);
byte[] b = new byte[audioInputStream.available()];
float[] floats = new float[b.length / 2];
//WhisperFullParams params = whisper.getFullDefaultParams(WhisperSamplingStrategy.WHISPER_SAMPLING_GREEDY);
WhisperFullParams params = whisper.getFullDefaultParams(WhisperSamplingStrategy.WHISPER_SAMPLING_BEAM_SEARCH);
params.setProgressCallback((ctx, state, progress, user_data) -> System.out.println("progress: " + progress));
params.print_progress = CBool.FALSE;
//params.initial_prompt = "and so my fellow Americans um, like";
try {
audioInputStream.read(b);
for (int i = 0, j = 0; i < b.length; i += 2, j++) {
int intSample = (int) (b[i + 1]) << 8 | (int) (b[i]) & 0xFF;
floats[j] = intSample / 32767.0f;
}
List<WhisperSegment> segments = whisper.fullTranscribeWithTime(params, floats);
assertTrue(segments.size() > 0, "The size of segments should be greater than 0");
for (WhisperSegment segment : segments) {
System.out.println(segment);
}
} finally {
audioInputStream.close();
}
}
}

View File

@ -41,7 +41,7 @@ make publish-npm
## Sample run
```java
```text
$ node --experimental-wasm-threads --experimental-wasm-simd ../tests/test-whisper.js
whisper_model_load: loading model from 'whisper.bin'
@ -63,7 +63,7 @@ whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 1 | BLAS = 0 |
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 1 | BLAS = 0 |
operator(): processing 176000 samples, 11.0 sec, 8 threads, 1 processors, lang = en, task = transcribe ...

View File

@ -1,6 +1,6 @@
{
"name": "whisper.cpp",
"version": "1.4.2",
"version": "1.7.2",
"description": "Whisper speech recognition",
"main": "whisper.js",
"scripts": {

File diff suppressed because one or more lines are too long

3
bindings/ruby/.gitignore vendored Normal file
View File

@ -0,0 +1,3 @@
LICENSE
pkg/
lib/whisper.*

169
bindings/ruby/README.md Normal file
View File

@ -0,0 +1,169 @@
whispercpp
==========
![whisper.cpp](https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg)
Ruby bindings for [whisper.cpp][], an interface of automatic speech recognition model.
Installation
------------
Install the gem and add to the application's Gemfile by executing:
$ bundle add whispercpp
If bundler is not being used to manage dependencies, install the gem by executing:
$ gem install whispercpp
Usage
-----
```ruby
require "whisper"
whisper = Whisper::Context.new("path/to/model.bin")
params = Whisper::Params.new
params.language = "en"
params.offset = 10_000
params.duration = 60_000
params.max_text_tokens = 300
params.translate = true
params.print_timestamps = false
params.initial_prompt = "Initial prompt here."
whisper.transcribe("path/to/audio.wav", params) do |whole_text|
puts whole_text
end
```
### Preparing model ###
Use script to download model file(s):
```bash
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
sh ./models/download-ggml-model.sh base.en
```
There are some types of models. See [models][] page for details.
### Preparing audio file ###
Currently, whisper.cpp accepts only 16-bit WAV files.
### API ###
Once `Whisper::Context#transcribe` called, you can retrieve segments by `#each_segment`:
```ruby
def format_time(time_ms)
sec, decimal_part = time_ms.divmod(1000)
min, sec = sec.divmod(60)
hour, min = min.divmod(60)
"%02d:%02d:%02d.%03d" % [hour, min, sec, decimal_part]
end
whisper.transcribe("path/to/audio.wav", params)
whisper.each_segment.with_index do |segment, index|
line = "[%{nth}: %{st} --> %{ed}] %{text}" % {
nth: index + 1,
st: format_time(segment.start_time),
ed: format_time(segment.end_time),
text: segment.text
}
line << " (speaker turned)" if segment.speaker_next_turn?
puts line
end
```
You can also add hook to params called on new segment:
```ruby
def format_time(time_ms)
sec, decimal_part = time_ms.divmod(1000)
min, sec = sec.divmod(60)
hour, min = min.divmod(60)
"%02d:%02d:%02d.%03d" % [hour, min, sec, decimal_part]
end
# Add hook before calling #transcribe
params.on_new_segment do |segment|
line = "[%{st} --> %{ed}] %{text}" % {
st: format_time(segment.start_time),
ed: format_time(segment.end_time),
text: segment.text
}
line << " (speaker turned)" if segment.speaker_next_turn?
puts line
end
whisper.transcribe("path/to/audio.wav", params)
```
You can see model information:
```ruby
whisper = Whisper::Context.new("path/to/model.bin")
model = whisper.model
model.n_vocab # => 51864
model.n_audio_ctx # => 1500
model.n_audio_state # => 512
model.n_audio_head # => 8
model.n_audio_layer # => 6
model.n_text_ctx # => 448
model.n_text_state # => 512
model.n_text_head # => 8
model.n_text_layer # => 6
model.n_mels # => 80
model.ftype # => 1
model.type # => "base"
```
You can set log callback:
```ruby
prefix = "[MyApp] "
log_callback = ->(level, buffer, user_data) {
case level
when Whisper::LOG_LEVEL_NONE
puts "#{user_data}none: #{buffer}"
when Whisper::LOG_LEVEL_INFO
puts "#{user_data}info: #{buffer}"
when Whisper::LOG_LEVEL_WARN
puts "#{user_data}warn: #{buffer}"
when Whisper::LOG_LEVEL_ERROR
puts "#{user_data}error: #{buffer}"
when Whisper::LOG_LEVEL_DEBUG
puts "#{user_data}debug: #{buffer}"
when Whisper::LOG_LEVEL_CONT
puts "#{user_data}same to previous: #{buffer}"
end
}
Whisper.log_set log_callback, prefix
```
Using this feature, you are also able to suppress log:
```ruby
Whisper.log_set ->(level, buffer, user_data) {
# do nothing
}, nil
Whisper::Context.new(MODEL)
```
License
-------
The same to [whisper.cpp][].
[whisper.cpp]: https://github.com/ggerganov/whisper.cpp
[models]: https://github.com/ggerganov/whisper.cpp/tree/master/models

68
bindings/ruby/Rakefile Normal file
View File

@ -0,0 +1,68 @@
require 'rake/clean'
require "bundler/gem_tasks"
require "pathname"
require "yaml"
require "rake/testtask"
extsources = YAML.load_file("extsources.yaml")
SOURCES = FileList[]
extsources.each do |src|
basename = src.pathmap("%f")
dest = basename == "LICENSE" ? basename : basename.pathmap("ext/%f")
file src
file dest => src do |t|
cp t.source, t.name
end
SOURCES.include dest
end
CLEAN.include SOURCES
CLEAN.include FileList[
"ext/*.o",
"ext/*.metal",
"ext/whisper.{so,bundle,dll}",
"ext/depend"
]
task build: FileList[
"ext/Makefile",
"ext/ruby_whisper.h",
"ext/ruby_whisper.cpp",
"whispercpp.gemspec",
]
directory "pkg"
CLOBBER.include "pkg"
TEST_MODEL = "../../models/ggml-base.en.bin"
LIB_NAME = "whisper".ext(RbConfig::CONFIG["DLEXT"])
SO_FILE = File.join("ext", LIB_NAME)
LIB_FILE = File.join("lib", LIB_NAME)
file "ext/Makefile" => ["ext/extconf.rb", "ext/ruby_whisper.h", "ext/ruby_whisper.cpp"] + SOURCES do |t|
Dir.chdir "ext" do
ruby "extconf.rb"
end
end
file SO_FILE => "ext/Makefile" do |t|
Dir.chdir "ext" do
sh "make"
end
end
CLEAN.include LIB_FILE
directory "lib"
file LIB_FILE => [SO_FILE, "lib"] do |t|
copy t.source, t.name
end
Rake::TestTask.new do |t|
t.test_files = FileList["tests/test_*.rb"]
end
task test: [TEST_MODEL, LIB_FILE]
file TEST_MODEL do
Dir.chdir "../.." do
sh "./models/download-ggml-model.sh base.en"
end
end

View File

@ -3,7 +3,33 @@ ggml.c
ggml.h
ggml-alloc.c
ggml-alloc.h
whisper.bundle
ggml-aarch64.c
ggml-aarch64.h
ggml-backend.cpp
ggml-backend-impl.h
ggml-backend.c
ggml-backend.h
ggml-common.h
ggml-cpu-impl.h
ggml-metal.m
ggml-metal.metal
ggml-metal-embed.metal
ggml-blas.cpp
ggml-cuda.h
ggml-impl.h
ggml-kompute.h
ggml-metal.h
ggml-opencl.h
ggml-quants.c
ggml-quants.h
ggml-sycl.h
ggml-vulkan.h
ggml-blas.h
get-flags.mk
whisper.cpp
whisper.h
dr_wav.h
depend
whisper.bundle
whisper.so
whisper.dll

View File

@ -1,21 +1,10 @@
require 'mkmf'
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','whisper.cpp')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','whisper.h')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','ggml.h')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','ggml.c')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','ggml-impl.h')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','ggml-alloc.h')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','ggml-alloc.c')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','ggml-backend-impl.h')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','ggml-backend.h')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','ggml-backend.c')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','ggml-quants.h')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','ggml-quants.c')} .")
system("cp #{File.join(File.dirname(__FILE__),'..','..','..','examples','dr_wav.h')} .")
# need to use c++ compiler flags
$CXXFLAGS << ' -std=c++11'
$LDFLAGS << ' -lstdc++'
# Set to true when building binary gems
if enable_config('static-stdlib', false)
$LDFLAGS << ' -static-libgcc -static-libstdc++'
@ -26,4 +15,180 @@ if enable_config('march-tune-native', false)
$CXXFLAGS << ' -march=native -mtune=native'
end
if ENV['WHISPER_METAL']
$GGML_METAL ||= true
$DEPRECATE_WARNING ||= true
end
$UNAME_S = `uname -s`.chomp
$UNAME_P = `uname -p`.chomp
$UNAME_M = `uname -m`.chomp
if $UNAME_S == 'Darwin'
unless ENV['GGML_NO_METAL']
$GGML_METAL ||= true
end
$GGML_NO_OPENMP ||= true
end
if $GGML_METAL
$GGML_METAL_EMBED_LIBRARY = true
end
$MK_CPPFLAGS = ''
$MK_CFLAGS = '-std=c11 -fPIC'
$MK_CXXFLAGS = '-std=c++11 -fPIC'
$MK_NVCCFLAGS = '-std=c++11'
$MK_LDFLAGS = ''
$OBJ_GGML = []
$OBJ_WHISPER = []
$OBJ_COMMON = []
$OBJ_SDL = []
$MK_CPPFLAGS << ' -D_XOPEN_SOURCE=600'
if $UNAME_S == 'Linux'
$MK_CPPFLAGS << ' -D_GNU_SOURCE'
end
if $UNAME_S == 'Darwin'
$MK_CPPFLAGS << ' -D_DARWIN_C_SOURCE'
end
if ENV['WHISPER_DEBUG']
$MK_CFLAGS << ' -O0 -g'
$MK_CXXFLAGS << ' -O0 -g'
$MK_LDFLAGS << ' -g'
$MK_NVCCFLAGS << ' -O0 -g'
else
$MK_CPPFLAGS << ' -DNDEBUG'
$MK_CFLAGS << ' -O3'
$MK_CXXFLAGS << ' -O3'
$MK_NVCCFLAGS << ' -O3'
end
$WARN_FLAGS =
' -Wall' <<
' -Wextra' <<
' -Wpedantic' <<
' -Wcast-qual' <<
' -Wno-unused-function'
$MK_CFLAGS <<
$WARN_FLAGS <<
' -Wshadow' <<
' -Wstrict-prototypes' <<
' -Wpointer-arith' <<
' -Wmissing-prototypes' <<
' -Werror=implicit-int' <<
' -Werror=implicit-function-declaration'
$MK_CXXFLAGS <<
$WARN_FLAGS <<
' -Wmissing-declarations' <<
' -Wmissing-noreturn'
unless `#{cc_command} #{$LDFLAGS} -Wl,-v 2>&1`.chomp.include? 'dyld-1015.7'
$MK_CPPFLAGS << ' -DHAVE_BUGGY_APPLE_LINKER'
end
if %w[Linux Darwin FreeBSD NetBSD OpenBSD Haiku].include? $UNAME_S
$MK_CFLAGS << ' -pthread'
$MK_CXXFLAGS << ' -pthread'
end
unless $_WIN32
$DSO_EXT = '.so'
else
$DSO_EXT = '.dll'
end
unless ENV['RISCV']
if %w[x86_64 i686 amd64].include? $UNAME_M
$HOST_CXXFLAGS ||= ''
$MK_CFLAGS << ' -march=native -mtune=native'
$HOST_CXXFLAGS << ' -march=native -mtune=native'
end
if $UNAME_M.match? /aarch64.*/
$MK_CFLAGS << ' -mcpu=native'
$MK_CXXFLAGS << ' -mcpu=native'
end
else
$MK_CFLAGS << ' -march=rv64gcv -mabi=lp64d'
$MK_CXXFLAGS << ' -march=rv64gcv -mabi=lp64d'
end
unless ENV['GGML_NO_ACCELERATE']
if $UNAME_S == 'Darwin'
$MK_CPPFLAGS << ' -DGGML_USE_ACCELERATE -DGGML_USE_BLAS'
$MK_CPPFLAGS << ' -DACCELERATE_NEW_LAPACK'
$MK_CPPFLAGS << ' -DACCELERATE_LAPACK_ILP64'
$MK_LDFLAGS << ' -framework Accelerate'
$OBJ_GGML << 'ggml-blas.o'
end
end
if ENV['GGML_OPENBLAS']
$MK_CPPFLAGS << " -DGGML_USE_BLAS #{`pkg-config --cflags-only-I openblas`.chomp}"
$MK_CFLAGS << " #{`pkg-config --cflags-only-other openblas)`.chomp}"
$MK_LDFLAGS << " #{`pkg-config --libs openblas`}"
$OBJ_GGML << 'ggml-blas.o'
end
if ENV['GGML_OPENBLAS64']
$MK_CPPFLAGS << " -DGGML_USE_BLAS #{`pkg-config --cflags-only-I openblas64`.chomp}"
$MK_CFLAGS << " #{`pkg-config --cflags-only-other openblas64)`.chomp}"
$MK_LDFLAGS << " #{`pkg-config --libs openblas64`}"
$OBJ_GGML << 'ggml-blas.o'
end
if $GGML_METAL
$MK_CPPFLAGS << ' -DGGML_USE_METAL'
$MK_LDFLAGS << ' -framework Foundation -framework Metal -framework MetalKit'
$OBJ_GGML << 'ggml-metal.o'
if ENV['GGML_METAL_NDEBUG']
$MK_CPPFLAGS << ' -DGGML_METAL_NDEBUG'
end
if $GGML_METAL_EMBED_LIBRARY
$MK_CPPFLAGS << ' -DGGML_METAL_EMBED_LIBRARY'
$OBJ_GGML << 'ggml-metal-embed.o'
end
end
$OBJ_GGML <<
'ggml.o' <<
'ggml-cpu.o' <<
'ggml-alloc.o' <<
'ggml-backend.o' <<
'ggml-quants.o' <<
'ggml-aarch64.o'
$OBJ_WHISPER <<
'whisper.o'
$objs = $OBJ_GGML + $OBJ_WHISPER + $OBJ_COMMON + $OBJ_SDL
$objs << "ruby_whisper.o"
$CPPFLAGS = "#{$MK_CPPFLAGS} #{$CPPFLAGS}"
$CFLAGS = "#{$CPPFLAGS} #{$MK_CFLAGS} #{$GF_CFLAGS} #{$CFLAGS}"
$BASE_CXXFLAGS = "#{$MK_CXXFLAGS} #{$CXXFLAGS}"
$CXXFLAGS = "#{$BASE_CXXFLAGS} #{$HOST_CXXFLAGS} #{$GF_CXXFLAGS} #{$CPPFLAGS}"
$NVCCFLAGS = "#{$MK_NVCCFLAGS} #{$NVCCFLAGS}"
$LDFLAGS = "#{$MK_LDFLAGS} #{$LDFLAGS}"
create_makefile('whisper')
File.open 'Makefile', 'a' do |file|
file.puts 'include get-flags.mk'
if $GGML_METAL
if $GGML_METAL_EMBED_LIBRARY
file.puts 'include metal-embed.mk'
end
end
end

View File

@ -1,87 +0,0 @@
#pragma once
// ggml-backend internal header
#include "ggml-backend.h"
#ifdef __cplusplus
extern "C" {
#endif
//
// Backend buffer
//
typedef void * ggml_backend_buffer_context_t;
struct ggml_backend_buffer_i {
void (*free_buffer) (ggml_backend_buffer_t buffer);
void * (*get_base) (ggml_backend_buffer_t buffer); // get base pointer
size_t (*get_alloc_size)(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor); // pre-allocation callback
void (*init_tensor) (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor); // post-allocation callback
void (*free_tensor) (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor); // pre-free callback
};
struct ggml_backend_buffer {
struct ggml_backend_buffer_i iface;
ggml_backend_t backend;
ggml_backend_buffer_context_t context;
size_t size;
};
GGML_API ggml_backend_buffer_t ggml_backend_buffer_init(
struct ggml_backend * backend,
struct ggml_backend_buffer_i iface,
ggml_backend_buffer_context_t context,
size_t size);
//
// Backend
//
typedef void * ggml_backend_context_t;
struct ggml_backend_i {
const char * (*get_name)(ggml_backend_t backend);
void (*free)(ggml_backend_t backend);
// buffer allocation
ggml_backend_buffer_t (*alloc_buffer)(ggml_backend_t backend, size_t size);
// get buffer alignment
size_t (*get_alignment)(ggml_backend_t backend);
// tensor data access
// these functions can be asynchronous, helper functions are provided for synchronous access that automatically call synchronize
void (*set_tensor_async)(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
void (*get_tensor_async)(ggml_backend_t backend, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size);
void (*synchronize) (ggml_backend_t backend);
// (optional) copy tensor between different backends, allow for single-copy tranfers
void (*cpy_tensor_from)(ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst);
void (*cpy_tensor_to) (ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst);
// compute graph with a plan
ggml_backend_graph_plan_t (*graph_plan_create) (ggml_backend_t backend, struct ggml_cgraph * cgraph);
void (*graph_plan_free) (ggml_backend_t backend, ggml_backend_graph_plan_t plan);
void (*graph_plan_compute)(ggml_backend_t backend, ggml_backend_graph_plan_t plan);
// compute graph without a plan
void (*graph_compute)(ggml_backend_t backend, struct ggml_cgraph * cgraph);
// check if the backend supports an operation
bool (*supports_op)(ggml_backend_t backend, const struct ggml_tensor * op);
};
struct ggml_backend {
struct ggml_backend_i iface;
ggml_backend_context_t context;
};
#ifdef __cplusplus
}
#endif

View File

@ -1,950 +0,0 @@
#include "ggml-backend-impl.h"
#include "ggml-alloc.h"
#include "ggml-impl.h"
#include <assert.h>
#include <limits.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define UNUSED GGML_UNUSED
#define MAX(a, b) ((a) > (b) ? (a) : (b))
// backend buffer
ggml_backend_buffer_t ggml_backend_buffer_init(
struct ggml_backend * backend,
struct ggml_backend_buffer_i iface,
ggml_backend_buffer_context_t context,
size_t size) {
ggml_backend_buffer_t buffer = malloc(sizeof(struct ggml_backend_buffer));
GGML_ASSERT(iface.get_base != NULL);
(*buffer) = (struct ggml_backend_buffer) {
/* .interface = */ iface,
/* .backend = */ backend,
/* .context = */ context,
/* .size = */ size,
};
return buffer;
}
void ggml_backend_buffer_free(ggml_backend_buffer_t buffer) {
if (buffer == NULL) {
return;
}
if (buffer->iface.free_buffer != NULL) {
buffer->iface.free_buffer(buffer);
}
free(buffer);
}
size_t ggml_backend_buffer_get_alignment(ggml_backend_buffer_t buffer) {
return ggml_backend_get_alignment(buffer->backend);
}
size_t ggml_backend_buffer_get_size(ggml_backend_buffer_t buffer) {
return buffer->size;
}
void * ggml_backend_buffer_get_base(ggml_backend_buffer_t buffer) {
void * base = buffer->iface.get_base(buffer);
GGML_ASSERT(base != NULL && "backend buffer base cannot be NULL");
return base;
}
size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
// get_alloc_size is optional, defaults to ggml_nbytes
if (buffer->iface.get_alloc_size) {
return buffer->iface.get_alloc_size(buffer, tensor);
}
return ggml_nbytes(tensor);
}
void ggml_backend_buffer_init_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
// init_tensor is optional
if (buffer->iface.init_tensor) {
buffer->iface.init_tensor(buffer, tensor);
}
}
void ggml_backend_buffer_free_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
// free_tensor is optional
if (buffer->iface.free_tensor) {
buffer->iface.free_tensor(buffer, tensor);
}
}
// backend
ggml_backend_t ggml_get_backend(const struct ggml_tensor * tensor) {
return tensor->buffer ? tensor->buffer->backend : NULL;
}
const char * ggml_backend_name(ggml_backend_t backend) {
if (backend == NULL) {
return "NULL";
}
return backend->iface.get_name(backend);
}
void ggml_backend_free(ggml_backend_t backend) {
if (backend == NULL) {
return;
}
backend->iface.free(backend);
}
ggml_backend_buffer_t ggml_backend_alloc_buffer(ggml_backend_t backend, size_t size) {
return backend->iface.alloc_buffer(backend, size);
}
size_t ggml_backend_get_alignment(ggml_backend_t backend) {
return backend->iface.get_alignment(backend);
}
void ggml_backend_tensor_set_async(struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
ggml_get_backend(tensor)->iface.set_tensor_async(ggml_get_backend(tensor), tensor, data, offset, size);
}
void ggml_backend_tensor_get_async(const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
ggml_get_backend(tensor)->iface.get_tensor_async(ggml_get_backend(tensor), tensor, data, offset, size);
}
void ggml_backend_tensor_set(struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
ggml_backend_t backend = ggml_get_backend(tensor);
GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
GGML_ASSERT(backend != NULL && "tensor backend not set");
backend->iface.set_tensor_async(backend, tensor, data, offset, size);
backend->iface.synchronize(backend);
}
void ggml_backend_tensor_get(const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
ggml_backend_t backend = ggml_get_backend(tensor);
GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
GGML_ASSERT(backend != NULL && "tensor backend not set");
backend->iface.get_tensor_async(backend, tensor, data, offset, size);
backend->iface.synchronize(backend);
}
void ggml_backend_synchronize(ggml_backend_t backend) {
backend->iface.synchronize(backend);
}
ggml_backend_graph_plan_t ggml_backend_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
return backend->iface.graph_plan_create(backend, cgraph);
}
void ggml_backend_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
backend->iface.graph_plan_free(backend, plan);
}
void ggml_backend_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
backend->iface.graph_plan_compute(backend, plan);
}
void ggml_backend_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
backend->iface.graph_compute(backend, cgraph);
}
bool ggml_backend_supports_op(ggml_backend_t backend, const struct ggml_tensor * op) {
return backend->iface.supports_op(backend, op);
}
// backend copy
static bool ggml_are_same_layout(const struct ggml_tensor * a, const struct ggml_tensor * b) {
if (a->type != b->type) {
return false;
}
for (int i = 0; i < GGML_MAX_DIMS; i++) {
if (a->ne[i] != b->ne[i]) {
return false;
}
if (a->nb[i] != b->nb[i]) {
return false;
}
}
return true;
}
void ggml_backend_tensor_copy(struct ggml_tensor * src, struct ggml_tensor * dst) {
//printf("src: %s ne: [%d %d %d %d] nb: [%d %d %d %d]\n", src->name, (int)src->ne[0], (int)src->ne[1], (int)src->ne[2], (int)src->ne[3], (int)src->nb[0], (int)src->nb[1], (int)src->nb[2], (int)src->nb[3]);
//printf("dst: %s ne: [%d %d %d %d] nb: [%d %d %d %d]\n", dst->name, (int)dst->ne[0], (int)dst->ne[1], (int)dst->ne[2], (int)dst->ne[3], (int)dst->nb[0], (int)dst->nb[1], (int)dst->nb[2], (int)dst->nb[3]);
GGML_ASSERT(ggml_are_same_layout(src, dst) && "cannot copy tensors with different layouts");
// fprintf(stderr, "cpy tensor %s from %s to %s (%lu bytes)\n", src->name, ggml_backend_name(src->backend), ggml_backend_name(dst->backend), ggml_nbytes(src));
if (src == dst) {
return;
}
// TODO: allow backends to support copy to/from same backend
if (ggml_get_backend(dst)->iface.cpy_tensor_from != NULL) {
ggml_get_backend(dst)->iface.cpy_tensor_from(ggml_get_backend(dst)->context, src, dst);
} else if (ggml_get_backend(src)->iface.cpy_tensor_to != NULL) {
ggml_get_backend(src)->iface.cpy_tensor_to(ggml_get_backend(src)->context, src, dst);
} else {
// shouldn't be hit when copying from/to CPU
#ifndef NDEBUG
fprintf(stderr, "ggml_backend_tensor_copy: neither cpy_tensor_from nor cpy_tensor_to are implemented for backends %s and %s, falling back to get/set\n", ggml_backend_name(src->buffer->backend), ggml_backend_name(dst->buffer->backend));
#endif
size_t nbytes = ggml_nbytes(src);
void * data = malloc(nbytes);
ggml_backend_tensor_get(src, data, 0, nbytes);
ggml_backend_tensor_set(dst, data, 0, nbytes);
free(data);
}
}
// backend CPU
struct ggml_backend_cpu_context {
int n_threads;
void * work_data;
size_t work_size;
};
static const char * ggml_backend_cpu_name(ggml_backend_t backend) {
return "CPU";
UNUSED(backend);
}
static void ggml_backend_cpu_free(ggml_backend_t backend) {
struct ggml_backend_cpu_context * cpu_ctx = (struct ggml_backend_cpu_context *)backend->context;
free(cpu_ctx->work_data);
free(cpu_ctx);
free(backend);
}
static void * ggml_backend_cpu_buffer_get_base(ggml_backend_buffer_t buffer) {
return (void *)buffer->context;
}
static void ggml_backend_cpu_buffer_free_buffer(ggml_backend_buffer_t buffer) {
free(buffer->context);
UNUSED(buffer);
}
static struct ggml_backend_buffer_i cpu_backend_buffer_i = {
/* .free_buffer = */ ggml_backend_cpu_buffer_free_buffer,
/* .get_base = */ ggml_backend_cpu_buffer_get_base,
/* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
/* .init_tensor = */ NULL, // no initialization required
/* .free_tensor = */ NULL, // no cleanup required
};
// for buffers from ptr, free is not called
static struct ggml_backend_buffer_i cpu_backend_buffer_i_from_ptr = {
/* .free_buffer = */ NULL, // ptr is not owned by the buffer, so it does not need to be freed
/* .get_base = */ ggml_backend_cpu_buffer_get_base,
/* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
/* .init_tensor = */ NULL,
/* .free_tensor = */ NULL,
};
static const size_t TENSOR_ALIGNMENT = 64; // should be enough for AVX 512
static ggml_backend_buffer_t ggml_backend_cpu_alloc_buffer(ggml_backend_t backend, size_t size) {
size += TENSOR_ALIGNMENT; // malloc may return an address that is not aligned
void * data = malloc(size); // TODO: maybe use GGML_ALIGNED_MALLOC?
GGML_ASSERT(data != NULL && "failed to allocate buffer");
return ggml_backend_buffer_init(backend, cpu_backend_buffer_i, data, size);
}
static size_t ggml_backend_cpu_get_alignment(ggml_backend_t backend) {
return TENSOR_ALIGNMENT;
UNUSED(backend);
}
static void ggml_backend_cpu_set_tensor_async(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor write out of bounds");
GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
memcpy((char *)tensor->data + offset, data, size);
UNUSED(backend);
}
static void ggml_backend_cpu_get_tensor_async(ggml_backend_t backend, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor read out of bounds");
GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
memcpy(data, (const char *)tensor->data + offset, size);
UNUSED(backend);
}
static void ggml_backend_cpu_synchronize(ggml_backend_t backend) {
UNUSED(backend);
}
static void ggml_backend_cpu_cpy_tensor_from(ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst) {
ggml_backend_tensor_get(src, dst->data, 0, ggml_nbytes(src));
UNUSED(backend);
}
static void ggml_backend_cpu_cpy_tensor_to(ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst) {
ggml_backend_tensor_set(dst, src->data, 0, ggml_nbytes(src));
UNUSED(backend);
}
struct ggml_backend_plan_cpu {
struct ggml_cplan cplan;
struct ggml_cgraph cgraph;
};
static ggml_backend_graph_plan_t ggml_backend_cpu_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
struct ggml_backend_cpu_context * cpu_ctx = (struct ggml_backend_cpu_context *)backend->context;
struct ggml_backend_plan_cpu * cpu_plan = malloc(sizeof(struct ggml_backend_plan_cpu));
cpu_plan->cplan = ggml_graph_plan(cgraph, cpu_ctx->n_threads);
cpu_plan->cgraph = *cgraph;
if (cpu_plan->cplan.work_size > 0) {
cpu_plan->cplan.work_data = malloc(cpu_plan->cplan.work_size);
}
return cpu_plan;
}
static void ggml_backend_cpu_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
struct ggml_backend_plan_cpu * cpu_plan = (struct ggml_backend_plan_cpu *)plan;
free(cpu_plan->cplan.work_data);
free(cpu_plan);
UNUSED(backend);
}
static void ggml_backend_cpu_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
struct ggml_backend_plan_cpu * cpu_plan = (struct ggml_backend_plan_cpu *)plan;
ggml_graph_compute(&cpu_plan->cgraph, &cpu_plan->cplan);
UNUSED(backend);
}
static void ggml_backend_cpu_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
struct ggml_backend_cpu_context * cpu_ctx = (struct ggml_backend_cpu_context *)backend->context;
struct ggml_cplan cplan = ggml_graph_plan(cgraph, cpu_ctx->n_threads);
if (cpu_ctx->work_size < cplan.work_size) {
// TODO: may be faster to free and use malloc to avoid the copy
cpu_ctx->work_data = realloc(cpu_ctx->work_data, cplan.work_size);
cpu_ctx->work_size = cplan.work_size;
}
cplan.work_data = cpu_ctx->work_data;
ggml_graph_compute(cgraph, &cplan);
}
static bool ggml_backend_cpu_supports_op(ggml_backend_t backend, const struct ggml_tensor * op) {
return true;
UNUSED(backend);
UNUSED(op);
}
static struct ggml_backend_i cpu_backend_i = {
/* .get_name = */ ggml_backend_cpu_name,
/* .free = */ ggml_backend_cpu_free,
/* .alloc_buffer = */ ggml_backend_cpu_alloc_buffer,
/* .get_alignment = */ ggml_backend_cpu_get_alignment,
/* .set_tensor_async = */ ggml_backend_cpu_set_tensor_async,
/* .get_tensor_async = */ ggml_backend_cpu_get_tensor_async,
/* .synchronize = */ ggml_backend_cpu_synchronize,
/* .cpy_tensor_from = */ ggml_backend_cpu_cpy_tensor_from,
/* .cpy_tensor_to = */ ggml_backend_cpu_cpy_tensor_to,
/* .graph_plan_create = */ ggml_backend_cpu_graph_plan_create,
/* .graph_plan_free = */ ggml_backend_cpu_graph_plan_free,
/* .graph_plan_compute = */ ggml_backend_cpu_graph_plan_compute,
/* .graph_compute = */ ggml_backend_cpu_graph_compute,
/* .supports_op = */ ggml_backend_cpu_supports_op,
};
ggml_backend_t ggml_backend_cpu_init(void) {
struct ggml_backend_cpu_context * ctx = malloc(sizeof(struct ggml_backend_cpu_context));
ctx->n_threads = GGML_DEFAULT_N_THREADS;
ctx->work_data = NULL;
ctx->work_size = 0;
ggml_backend_t cpu_backend = malloc(sizeof(struct ggml_backend));
*cpu_backend = (struct ggml_backend) {
/* .interface = */ cpu_backend_i,
/* .context = */ ctx
};
return cpu_backend;
}
bool ggml_backend_is_cpu(ggml_backend_t backend) {
return backend->iface.get_name == ggml_backend_cpu_name;
}
void ggml_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads) {
GGML_ASSERT(ggml_backend_is_cpu(backend_cpu));
struct ggml_backend_cpu_context * ctx = (struct ggml_backend_cpu_context *)backend_cpu->context;
ctx->n_threads = n_threads;
}
ggml_backend_buffer_t ggml_backend_cpu_buffer_from_ptr(ggml_backend_t backend_cpu, void * ptr, size_t size) {
return ggml_backend_buffer_init(backend_cpu, cpu_backend_buffer_i_from_ptr, ptr, size);
}
// scheduler
#define GGML_MAX_BACKENDS 4
#define GGML_MAX_SPLITS 256
#define GGML_MAX_SPLIT_INPUTS 16
struct ggml_backend_sched_split {
ggml_tallocr_t tallocr;
int i_start;
int i_end;
struct ggml_tensor * inputs[GGML_MAX_SPLIT_INPUTS];
int n_inputs;
struct ggml_cgraph * graph;
};
struct ggml_backend_sched {
int n_backends;
ggml_backend_t backends[GGML_MAX_BACKENDS];
ggml_tallocr_t tallocs[GGML_MAX_BACKENDS];
ggml_gallocr_t galloc;
struct ggml_hash_set hash_set;
ggml_tallocr_t * node_talloc; // [hash_set.size]
struct ggml_tensor * (* node_copies)[GGML_MAX_BACKENDS]; // [hash_set.size][GGML_MAX_BACKENDS]
struct ggml_cgraph * graph;
struct ggml_backend_sched_split splits[GGML_MAX_SPLITS];
int n_splits;
struct ggml_context * ctx;
// align context_buffer to GGML_MEM_ALIGN
#ifdef _MSC_VER
__declspec(align(GGML_MEM_ALIGN))
#else
__attribute__((aligned(GGML_MEM_ALIGN)))
#endif
char context_buffer[GGML_MAX_SPLITS*GGML_MAX_SPLIT_INPUTS*sizeof(struct ggml_tensor) + GGML_MAX_SPLITS*sizeof(struct ggml_cgraph)];
};
#define hash_id(node) ggml_hash_find_or_insert(sched->hash_set, node)
#define node_allocr(node) sched->node_talloc[hash_id(node)]
static bool ggml_is_view_op(enum ggml_op op) {
return op == GGML_OP_VIEW || op == GGML_OP_RESHAPE || op == GGML_OP_PERMUTE || op == GGML_OP_TRANSPOSE;
}
// returns the priority of the backend, lower is better
static int sched_backend_prio(ggml_backend_sched_t sched, ggml_backend_t backend) {
for (int i = 0; i < sched->n_backends; i++) {
if (sched->backends[i] == backend) {
return i;
}
}
return INT_MAX;
}
static int sched_allocr_prio(ggml_backend_sched_t sched, ggml_tallocr_t allocr) {
for (int i = 0; i < sched->n_backends; i++) {
if (sched->tallocs[i] == allocr) {
return i;
}
}
return INT_MAX;
}
// returns the backend that should be used for the node based on the current locations
char causes[GGML_DEFAULT_GRAPH_SIZE*4 + GGML_MAX_SPLITS*GGML_MAX_SPLIT_INPUTS][128]; // debug, remove
static ggml_backend_t sched_backend_from_cur(ggml_backend_sched_t sched, struct ggml_tensor * node) {
// if the dst tensor is already allocated in a buffer, we must assume that it is critical to keep it there
// ie. kv cache updates
// note that this doesn't allow fallback to CPU. need to add output tensors to the splits to copy the data back to the original backend.
// dst
ggml_backend_t cur_backend = ggml_get_backend(node);
if (cur_backend != NULL) {
sprintf(causes[hash_id(node)], "1.dst");
return cur_backend;
}
// view_src
if (node->view_src != NULL && ggml_get_backend(node->view_src) != NULL) {
sprintf(causes[hash_id(node)], "1.vsrc");
return ggml_get_backend(node->view_src);
}
// src
int cur_prio = INT_MAX;
size_t cur_size = 0;
for (int i = 0; i < GGML_MAX_SRC; i++) {
const struct ggml_tensor * src = node->src[i];
if (src == NULL) {
break;
}
ggml_backend_t src_backend = ggml_get_backend(src);
if (src_backend != NULL) {
int src_prio = sched_backend_prio(sched, src_backend);
size_t src_size = ggml_nbytes(src);
if (src_prio < cur_prio && src_size >= cur_size) {
cur_prio = src_prio;
cur_size = src_size;
cur_backend = src_backend;
sprintf(causes[hash_id(node)], "1.src%d", i);
}
}
}
return cur_backend;
}
static char * fmt_size(size_t size) {
static char buffer[128];
if (size >= 1024*1024) {
sprintf(buffer, "%zuM", size/1024/1024);
} else {
sprintf(buffer, "%zuK", size/1024);
}
return buffer;
}
static void sched_print_assignments(ggml_backend_sched_t sched, struct ggml_cgraph * graph) {
int cur_split = 0;
for (int i = 0; i < graph->n_nodes; i++) {
if (cur_split < sched->n_splits && i == sched->splits[cur_split].i_start) {
ggml_backend_t split_backend = ggml_tallocr_get_buffer(sched->splits[cur_split].tallocr)->backend;
fprintf(stderr, "\n## SPLIT #%d: %s # %d inputs: ", cur_split, ggml_backend_name(split_backend), sched->splits[cur_split].n_inputs);
for (int j = 0; j < sched->splits[cur_split].n_inputs; j++) {
fprintf(stderr, "[%s (%5.5s)] ", sched->splits[cur_split].inputs[j]->name, fmt_size(ggml_nbytes(sched->splits[cur_split].inputs[j])));
}
fprintf(stderr, "\n");
cur_split++;
}
struct ggml_tensor * node = graph->nodes[i];
if (ggml_is_view_op(node->op)) {
continue;
}
ggml_tallocr_t node_allocr = node_allocr(node);
ggml_backend_t node_backend = node_allocr ? ggml_tallocr_get_buffer(node_allocr)->backend : NULL;
fprintf(stderr, "node #%3d (%10.10s): %20.20s (%4.4s) [%4.4s %8.8s]:", i, ggml_op_name(node->op), node->name, fmt_size(ggml_nbytes(node)), node_allocr ? ggml_backend_name(node_backend) : "NULL", causes[hash_id(node)]);
for (int j = 0; j < GGML_MAX_SRC; j++) {
struct ggml_tensor * src = node->src[j];
if (src == NULL) {
break;
}
ggml_tallocr_t src_allocr = node_allocr(src);
ggml_backend_t src_backend = src_allocr ? ggml_tallocr_get_buffer(src_allocr)->backend : NULL;
fprintf(stderr, " %20.20s (%4.4s) [%4.4s %8.8s]", src->name, fmt_size(ggml_nbytes(src)), src_backend ? ggml_backend_name(src_backend) : "NULL", causes[hash_id(src)]);
}
fprintf(stderr, "\n");
}
}
// creates a copy of the tensor with the same memory layout
static struct ggml_tensor * ggml_dup_tensor_layout(struct ggml_context * ctx, const struct ggml_tensor * tensor) {
struct ggml_tensor * dup = ggml_dup_tensor(ctx, tensor);
for (int i = 0; i < GGML_MAX_DIMS; i++) {
dup->nb[i] = tensor->nb[i];
}
return dup;
}
// assigns backends to ops and splits the graph into subgraphs that can be computed on the same backend
// TODO: merge passes
static void sched_split_graph(ggml_backend_sched_t sched, struct ggml_cgraph * graph) {
// reset state
size_t hash_size = sched->hash_set.size;
memset(sched->hash_set.keys, 0, sizeof(sched->hash_set.keys[0]) * hash_size);
memset(sched->node_talloc, 0, sizeof(sched->node_talloc[0]) * hash_size);
memset(sched->node_copies, 0, sizeof(sched->node_copies[0]) * hash_size);
sched->n_splits = 0;
struct ggml_init_params params = {
/*.mem_size = */ sizeof(sched->context_buffer),
/*.mem_buffer = */ sched->context_buffer,
/*.no_alloc = */ true
};
if (sched->ctx != NULL) {
ggml_free(sched->ctx);
}
sched->ctx = ggml_init(params);
// pass 1: assign backends to ops with allocated inputs
for (int i = 0; i < graph->n_leafs; i++) {
struct ggml_tensor * leaf = graph->leafs[i];
if (node_allocr(leaf) != NULL) {
// do not overwrite user assignments
continue;
}
ggml_backend_t leaf_backend = ggml_get_backend(leaf);
if (leaf_backend == NULL && leaf->view_src != NULL) {
leaf_backend = ggml_get_backend(leaf->view_src);
}
if (leaf_backend != NULL) {
node_allocr(leaf) = ggml_backend_sched_get_tallocr(sched, leaf_backend);
}
}
for (int i = 0; i < graph->n_nodes; i++) {
struct ggml_tensor * node = graph->nodes[i];
if (node_allocr(node) != NULL) {
// do not overwrite user assignments
continue;
}
ggml_backend_t node_backend = sched_backend_from_cur(sched, node);
if (node_backend != NULL) {
node_allocr(node) = ggml_backend_sched_get_tallocr(sched, node_backend);
}
}
//printf("PASS 1 ASSIGNMENTS\n"); sched_print_assignments(sched, graph);
// pass 2: assign backends to ops from current assignments
// TODO:
// - reuse sched_backend_from_cur
for (int i = 0; i < graph->n_nodes; i++) {
struct ggml_tensor * node = graph->nodes[i];
ggml_tallocr_t node_allocr = node_allocr(node);
if (node_allocr == NULL) {
int cur_prio = INT_MAX;
size_t cur_size = 0;
for (int j = 0; j < GGML_MAX_SRC; j++) {
struct ggml_tensor * src = node->src[j];
if (src == NULL) {
break;
}
ggml_tallocr_t src_allocr = node_allocr(src);
if (src_allocr != NULL) {
int src_prio = sched_allocr_prio(sched, src_allocr);
size_t src_size = ggml_nbytes(src);
if (src_prio < cur_prio && src_size >= cur_size) {
cur_prio = src_prio;
cur_size = src_size;
node_allocr = src_allocr;
sprintf(causes[hash_id(node)], "2.src%d", j);
}
}
}
if (node_allocr != NULL) {
node_allocr(node) = node_allocr;
}
}
}
//printf("PASS 2 ASSIGNMENTS\n"); sched_print_assignments(sched, graph);
// pass 3: assign backends to remaining src from dst (should only be leafs)
for (int i = 0; i < graph->n_nodes; i++) {
struct ggml_tensor * node = graph->nodes[i];
ggml_tallocr_t node_allocr = node_allocr(node);
for (int j = 0; j < GGML_MAX_SRC; j++) {
struct ggml_tensor * src = node->src[j];
if (src == NULL) {
break;
}
ggml_tallocr_t src_allocr = node_allocr(src);
if (src_allocr == NULL) {
node_allocr(src) = node_allocr;
}
}
}
//printf("PASS 3 ASSIGNMENTS\n"); sched_print_assignments(sched, graph);
// pass 4: split graph, find tensors that need to be copied
// TODO:
// - when switching from a less preferred backend to a more preferred backend, check if it is possible to move the switch to an earlier point for the same cost
// find first backend
int cur_split = 0;
for (int i = 0; i < graph->n_nodes; i++) {
struct ggml_tensor * node = graph->nodes[i];
if (node->view_src == NULL) {
sched->splits[0].tallocr = node_allocr(node);
break;
}
}
sched->splits[0].i_start = 0;
sched->splits[0].n_inputs = 0;
memset(sched->splits[0].inputs, 0, sizeof(sched->splits[0].inputs)); //HACK
ggml_tallocr_t cur_allocr = sched->splits[0].tallocr;
size_t cur_backend_id = sched_allocr_prio(sched, cur_allocr);
for (int i = 0; i < graph->n_nodes; i++) {
struct ggml_tensor * node = graph->nodes[i];
if (ggml_is_view_op(node->op)) {
continue;
}
ggml_tallocr_t node_allocr = node_allocr(node);
if (node_allocr != cur_allocr) {
sched->splits[cur_split].i_end = i;
cur_split++;
GGML_ASSERT(cur_split < GGML_MAX_SPLITS);
sched->splits[cur_split].tallocr = node_allocr;
sched->splits[cur_split].i_start = i;
sched->splits[cur_split].n_inputs = 0;
memset(sched->splits[cur_split].inputs, 0, sizeof(sched->splits[cur_split].inputs)); //HACK
cur_allocr = node_allocr;
cur_backend_id = sched_allocr_prio(sched, cur_allocr);
}
// find inputs that are not on the same backend
for (int j = 0; j < GGML_MAX_SRC; j++) {
struct ggml_tensor * src = node->src[j];
if (src == NULL) {
break;
}
ggml_tallocr_t src_allocr = node_allocr(src);
if (src_allocr != node_allocr) {
int n_inputs = sched->splits[cur_split].n_inputs++;
GGML_ASSERT(n_inputs < GGML_MAX_SPLIT_INPUTS);
sched->splits[cur_split].inputs[n_inputs] = (struct ggml_tensor *)src;
// create copies
size_t id = hash_id(src);
if (sched->node_copies[id][cur_backend_id] == NULL) {
struct ggml_tensor * tensor_copy = ggml_dup_tensor_layout(sched->ctx, src);
sched->node_copies[id][cur_backend_id] = tensor_copy;
node_allocr(tensor_copy) = cur_allocr;
ggml_backend_t backend = ggml_tallocr_get_buffer(cur_allocr)->backend;
ggml_format_name(tensor_copy, "%s#%s", ggml_backend_name(backend), src->name);
}
node->src[j] = sched->node_copies[id][cur_backend_id];
}
}
}
sched->splits[cur_split].i_end = graph->n_nodes;
sched->n_splits = cur_split + 1;
//fprintf(stderr, "PASS 4 ASSIGNMENTS\n"); sched_print_assignments(sched, graph); fflush(stdout);
#if 1
// sanity check: all sources should have the same backend as the node
for (int i = 0; i < graph->n_nodes; i++) {
struct ggml_tensor * node = graph->nodes[i];
ggml_tallocr_t node_allocr = node_allocr(node);
if (node_allocr == NULL) {
fprintf(stderr, "!!!!!!! %s has no backend\n", node->name);
}
for (int j = 0; j < GGML_MAX_SRC; j++) {
struct ggml_tensor * src = node->src[j];
if (src == NULL) {
break;
}
ggml_tallocr_t src_allocr = node_allocr(src);
if (src_allocr != node_allocr /* && src_backend != NULL */) { // ignore nulls for now
fprintf(stderr, "!!!! %s has backend %s, src %d (%s) has backend %s\n",
node->name, node_allocr ? ggml_backend_name(ggml_tallocr_get_buffer(node_allocr)->backend) : "NULL",
j, src->name, src_allocr ? ggml_backend_name(ggml_tallocr_get_buffer(src_allocr)->backend) : "NULL");
}
}
}
#endif
// create copies of the graph for each split
// FIXME: avoid this copy, pass split inputs to ggml_gallocr_alloc_graph_n in some other way
struct ggml_cgraph * graph_copy = ggml_new_graph_custom(sched->ctx, graph->n_nodes + sched->n_splits*GGML_MAX_SPLIT_INPUTS, false);
for (int i = 0; i < sched->n_splits; i++) {
struct ggml_backend_sched_split * split = &sched->splits[i];
split->graph = ggml_graph_view(sched->ctx, graph, split->i_start, split->i_end);
// add inputs to the graph copy so that they are allocated by ggml-alloc at the start of the split
for (int j = 0; j < split->n_inputs; j++) {
struct ggml_tensor * input = split->inputs[j];
struct ggml_tensor * input_cpy = sched->node_copies[hash_id(input)][sched_allocr_prio(sched, split->tallocr)];
input_cpy->src[0] = input;
graph_copy->nodes[graph_copy->n_nodes++] = input_cpy;
}
for (int j = split->i_start; j < split->i_end; j++) {
graph_copy->nodes[graph_copy->n_nodes++] = graph->nodes[j];
}
}
sched->graph = graph_copy;
}
static void sched_alloc_splits(ggml_backend_sched_t sched) {
ggml_gallocr_alloc_graph_n(
sched->galloc,
sched->graph,
sched->hash_set,
sched->node_talloc);
}
static void sched_compute_splits(ggml_backend_sched_t sched) {
uint64_t copy_us[GGML_MAX_BACKENDS] = {0};
uint64_t compute_us[GGML_MAX_BACKENDS] = {0};
struct ggml_backend_sched_split * splits = sched->splits;
for (int i = 0; i < sched->n_splits; i++) {
struct ggml_backend_sched_split * split = &splits[i];
ggml_backend_t split_backend = ggml_tallocr_get_buffer(split->tallocr)->backend;
int split_backend_id = sched_backend_prio(sched, split_backend);
// copy the input tensors to the split backend
uint64_t copy_start_us = ggml_time_us();
for (int j = 0; j < split->n_inputs; j++) {
struct ggml_tensor * input_cpy = sched->node_copies[hash_id(split->inputs[j])][sched_backend_prio(sched, split_backend)];
if (split->inputs[j]->buffer == NULL) {
if (split->inputs[j]->view_src == NULL) {
fprintf(stderr, "input %s has no buffer and no view_src\n", split->inputs[j]->name);
exit(1);
}
struct ggml_tensor * view = split->inputs[j];
view->backend = view->view_src->backend;
view->buffer = view->view_src->buffer;
view->data = (char *)view->view_src->data + view->view_offs;
ggml_backend_buffer_init_tensor(ggml_backend_sched_get_buffer(sched, view->buffer->backend), view);
}
if (input_cpy->buffer == NULL) {
fprintf(stderr, "input_cpy %s has no buffer\n", input_cpy->name);
exit(1);
}
GGML_ASSERT(split->inputs[j]->buffer->backend != input_cpy->buffer->backend);
GGML_ASSERT(input_cpy->buffer->backend == split_backend);
ggml_backend_tensor_copy(split->inputs[j], input_cpy);
}
// ggml_backend_synchronize(split_backend);
int64_t copy_end_us = ggml_time_us();
copy_us[split_backend_id] += copy_end_us - copy_start_us;
#if 0
char split_filename[GGML_MAX_NAME];
snprintf(split_filename, GGML_MAX_NAME, "split_%i_%s.dot", i, ggml_backend_name(split_backend));
ggml_graph_dump_dot(split->graph, NULL, split_filename);
#endif
uint64_t compute_start_us = ggml_time_us();
ggml_backend_graph_compute(split_backend, split->graph);
// ggml_backend_synchronize(split_backend);
uint64_t compute_end_us = ggml_time_us();
compute_us[split_backend_id] += compute_end_us - compute_start_us;
}
#if 0
// per-backend timings
fprintf(stderr, "sched_compute_splits times (%d splits):\n", sched->n_splits);
for (int i = 0; i < sched->n_backends; i++) {
if (copy_us[i] > 0 || compute_us[i] > 0) {
fprintf(stderr, "\t%5.5s: %lu us copy, %lu us compute\n", ggml_backend_name(sched->backends[i]), copy_us[i], compute_us[i]);
}
}
#endif
}
static void sched_reset(ggml_backend_sched_t sched) {
for (int i = 0; i < sched->n_backends; i++) {
ggml_tallocr_reset(sched->tallocs[i]);
}
}
ggml_backend_sched_t ggml_backend_sched_new(ggml_backend_t * backends, int n_backends) {
GGML_ASSERT(n_backends <= GGML_MAX_BACKENDS);
struct ggml_backend_sched * sched = malloc(sizeof(struct ggml_backend_sched));
memset(sched, 0, sizeof(struct ggml_backend_sched));
fprintf(stderr, "ggml_backend_sched size: %lu KB\n", sizeof(struct ggml_backend_sched)/1024);
sched->n_backends = n_backends;
for (int i = 0; i < n_backends; i++) {
sched->backends[i] = backends[i];
}
sched->galloc = ggml_gallocr_new();
// init measure allocs for each backend
for (int i = 0; i < n_backends; i++) {
sched->tallocs[i] = ggml_tallocr_new_measure_from_backend(backends[i]);
}
return sched;
}
void ggml_backend_sched_free(ggml_backend_sched_t sched) {
if (sched == NULL) {
return;
}
for (int i = 0; i < sched->n_backends; i++) {
ggml_tallocr_free(sched->tallocs[i]);
}
ggml_gallocr_free(sched->galloc);
free(sched->hash_set.keys);
free(sched->node_talloc);
free(sched->node_copies);
free(sched);
}
void ggml_backend_sched_init_measure(ggml_backend_sched_t sched, struct ggml_cgraph * measure_graph) {
// initialize hash tables
size_t hash_size = measure_graph->visited_hash_table.size + GGML_MAX_SPLITS*GGML_MAX_SPLIT_INPUTS;
sched->hash_set.size = hash_size;
sched->hash_set.keys = malloc(sizeof(sched->hash_set.keys[0]) * hash_size);
sched->node_talloc = malloc(sizeof(sched->node_talloc[0]) * hash_size);
sched->node_copies = malloc(sizeof(sched->node_copies[0]) * hash_size);
sched_split_graph(sched, measure_graph);
sched_alloc_splits(sched);
// allocate buffers and reset allocators
for (int i = 0; i < sched->n_backends; i++) {
size_t size = ggml_tallocr_max_size(sched->tallocs[i]);
ggml_tallocr_free(sched->tallocs[i]);
sched->tallocs[i] = ggml_tallocr_new_from_backend(sched->backends[i], size);
}
sched_reset(sched);
}
void ggml_backend_sched_graph_compute(ggml_backend_sched_t sched, struct ggml_cgraph * graph) {
GGML_ASSERT(sched->hash_set.size >= graph->visited_hash_table.size + GGML_MAX_SPLITS*GGML_MAX_SPLIT_INPUTS);
sched_split_graph(sched, graph);
sched_alloc_splits(sched);
sched_compute_splits(sched);
sched_reset(sched);
}
ggml_tallocr_t ggml_backend_sched_get_tallocr(ggml_backend_sched_t sched, ggml_backend_t backend) {
int backend_index = sched_backend_prio(sched, backend);
return sched->tallocs[backend_index];
}
ggml_backend_buffer_t ggml_backend_sched_get_buffer(ggml_backend_sched_t sched, ggml_backend_t backend) {
int backend_index = sched_backend_prio(sched, backend);
return ggml_tallocr_get_buffer(sched->tallocs[backend_index]);
}
void ggml_backend_sched_set_node_backend(ggml_backend_sched_t sched, struct ggml_tensor * node, ggml_backend_t backend) {
int backend_index = sched_backend_prio(sched, backend);
GGML_ASSERT(backend_index >= 0 && backend_index < sched->n_backends);
node_allocr(node) = sched->tallocs[backend_index];
}

View File

@ -1,136 +0,0 @@
#pragma once
#include "ggml.h"
#include "ggml-alloc.h"
#ifdef __cplusplus
extern "C" {
#endif
//
// Backend buffer
//
struct ggml_backend_buffer;
typedef struct ggml_backend_buffer * ggml_backend_buffer_t;
// backend buffer functions
GGML_API void ggml_backend_buffer_free (ggml_backend_buffer_t buffer);
GGML_API size_t ggml_backend_buffer_get_alignment (ggml_backend_buffer_t buffer);
GGML_API void * ggml_backend_buffer_get_base (ggml_backend_buffer_t buffer);
GGML_API size_t ggml_backend_buffer_get_size (ggml_backend_buffer_t buffer);
GGML_API size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
GGML_API void ggml_backend_buffer_init_tensor (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
GGML_API void ggml_backend_buffer_free_tensor (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
//
// Backend
//
struct ggml_backend;
typedef struct ggml_backend * ggml_backend_t;
typedef void * ggml_backend_graph_plan_t;
GGML_API ggml_backend_t ggml_get_backend(const struct ggml_tensor * tensor);
GGML_API const char * ggml_backend_name(ggml_backend_t backend);
GGML_API void ggml_backend_free(ggml_backend_t backend);
GGML_API ggml_backend_buffer_t ggml_backend_alloc_buffer(ggml_backend_t backend, size_t size);
GGML_API size_t ggml_backend_get_alignment(ggml_backend_t backend);
GGML_API void ggml_backend_tensor_set_async( struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
GGML_API void ggml_backend_tensor_get_async(const struct ggml_tensor * tensor, void * data, size_t offset, size_t size);
GGML_API void ggml_backend_tensor_set( struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
GGML_API void ggml_backend_tensor_get(const struct ggml_tensor * tensor, void * data, size_t offset, size_t size);
GGML_API void ggml_backend_synchronize(ggml_backend_t backend);
GGML_API ggml_backend_graph_plan_t ggml_backend_graph_plan_create (ggml_backend_t backend, struct ggml_cgraph * cgraph);
GGML_API void ggml_backend_graph_plan_free (ggml_backend_t backend, ggml_backend_graph_plan_t plan);
GGML_API void ggml_backend_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan);
GGML_API void ggml_backend_graph_compute (ggml_backend_t backend, struct ggml_cgraph * cgraph);
GGML_API bool ggml_backend_supports_op (ggml_backend_t backend, const struct ggml_tensor * op);
// tensor copy between different backends
GGML_API void ggml_backend_tensor_copy(struct ggml_tensor * src, struct ggml_tensor * dst);
//
// CPU backend
//
GGML_API ggml_backend_t ggml_backend_cpu_init(void);
GGML_API bool ggml_backend_is_cpu(ggml_backend_t backend);
GGML_API void ggml_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads);
// Create a backend buffer from an existing pointer
GGML_API ggml_backend_buffer_t ggml_backend_cpu_buffer_from_ptr(ggml_backend_t backend_cpu, void * ptr, size_t size);
//
// Backend scheduler
//
// The backend scheduler allows for multiple backends to be used together
// Handles compute buffer allocation, assignment of tensors to backends, and copying of tensors between backends
// The backends are selected based on:
// - the backend that supports the operation
// - the location of the pre-allocated tensors (e.g. the weights)
/*
Example usage:
sched = ggml_backend_sched_new({backend_gpu, backend_gpu2, backend_cpu}, num_backends);
// sched is initialized with measure allocators and cannot be used until allocated with a measure graph
// initialize buffers from a measure graph
measure_graph = build_graph(sched); // use the allocr to allocate inputs as needed
// in build_graph:
build_graph(...) {
// allocating tensors in a specific backend (optional, recommended: pre-allocate inputs in a different buffer)
alloc_cpu = ggml_backend_sched_get_allocr(sched, backend_cpu);
ggml_allocr_alloc(alloc_cpu, tensor);
// manually assigning nodes to a backend (optional, shouldn't be needed in most cases)
struct ggml_tensor * node = ggml_mul_mat(ctx, ...);
ggml_backend_sched_set_node_backend(sched, node, backend_gpu);
}
// allocate backend buffers from measure graph
ggml_backend_sched_init_measure(sched, measure_graph);
// the scheduler is now ready to compute graphs
// compute
graph = build_graph(sched);
ggml_backend_sched_graph_compute(sched, graph);
*/
struct ggml_backend_sched;
typedef struct ggml_backend_sched * ggml_backend_sched_t;
// Initialize a backend scheduler
GGML_API ggml_backend_sched_t ggml_backend_sched_new(ggml_backend_t * backends, int n_backends);
GGML_API void ggml_backend_sched_free(ggml_backend_sched_t sched);
// Initialize backend buffers from a measure graph
GGML_API void ggml_backend_sched_init_measure(ggml_backend_sched_t sched, struct ggml_cgraph * measure_graph);
GGML_API ggml_tallocr_t ggml_backend_sched_get_tallocr(ggml_backend_sched_t sched, ggml_backend_t backend);
GGML_API ggml_backend_buffer_t ggml_backend_sched_get_buffer (ggml_backend_sched_t sched, ggml_backend_t backend);
GGML_API void ggml_backend_sched_set_node_backend(ggml_backend_sched_t sched, struct ggml_tensor * node, ggml_backend_t backend);
// Allocate a graph on the backend scheduler
GGML_API void ggml_backend_sched_graph_compute(
ggml_backend_sched_t sched,
struct ggml_cgraph * graph);
#ifdef __cplusplus
}
#endif

View File

@ -1,249 +0,0 @@
#pragma once
#include "ggml.h"
// GGML internal header
#include <assert.h>
#include <stddef.h>
#include <stdbool.h>
#include <string.h> // memcpy
#include <math.h> // fabsf
#ifdef __cplusplus
extern "C" {
#endif
// static_assert should be a #define, but if it's not,
// fall back to the _Static_assert C11 keyword.
// if C99 - static_assert is noop
// ref: https://stackoverflow.com/a/53923785/4039976
#ifndef static_assert
#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201100L)
#define static_assert(cond, msg) _Static_assert(cond, msg)
#else
#define static_assert(cond, msg) struct global_scope_noop_trick
#endif
#endif
// __FMA__ and __F16C__ are not defined in MSVC, however they are implied with AVX2/AVX512
#if defined(_MSC_VER) && (defined(__AVX2__) || defined(__AVX512F__))
#ifndef __FMA__
#define __FMA__
#endif
#ifndef __F16C__
#define __F16C__
#endif
#ifndef __SSE3__
#define __SSE3__
#endif
#endif
#undef MIN
#undef MAX
#define MIN(a, b) ((a) < (b) ? (a) : (b))
#define MAX(a, b) ((a) > (b) ? (a) : (b))
// 16-bit float
// on Arm, we use __fp16
// on x86, we use uint16_t
#if defined(__ARM_NEON) && !defined(_MSC_VER)
// if YCM cannot find <arm_neon.h>, make a symbolic link to it, for example:
//
// $ ln -sfn /Library/Developer/CommandLineTools/usr/lib/clang/13.1.6/include/arm_neon.h ./src/
//
#include <arm_neon.h>
#define GGML_COMPUTE_FP16_TO_FP32(x) ((float) (x))
#define GGML_COMPUTE_FP32_TO_FP16(x) (x)
#define GGML_FP16_TO_FP32(x) ((float) (x))
#define GGML_FP32_TO_FP16(x) (x)
#else
#ifdef __wasm_simd128__
#include <wasm_simd128.h>
#else
#ifdef __POWER9_VECTOR__
#include <altivec.h>
#undef bool
#define bool _Bool
#else
#if defined(_MSC_VER) || defined(__MINGW32__)
#include <intrin.h>
#else
#if defined(__AVX__) || defined(__AVX2__) || defined(__AVX512F__) || defined(__SSSE3__) || defined(__SSE3__)
#if !defined(__riscv)
#include <immintrin.h>
#endif
#endif
#endif
#endif
#endif
#ifdef __riscv_v_intrinsic
#include <riscv_vector.h>
#endif
#ifdef __F16C__
#ifdef _MSC_VER
#define GGML_COMPUTE_FP16_TO_FP32(x) _mm_cvtss_f32(_mm_cvtph_ps(_mm_cvtsi32_si128(x)))
#define GGML_COMPUTE_FP32_TO_FP16(x) _mm_extract_epi16(_mm_cvtps_ph(_mm_set_ss(x), 0), 0)
#else
#define GGML_COMPUTE_FP16_TO_FP32(x) _cvtsh_ss(x)
#define GGML_COMPUTE_FP32_TO_FP16(x) _cvtss_sh(x, 0)
#endif
#elif defined(__POWER9_VECTOR__)
#define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
#define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)
/* the inline asm below is about 12% faster than the lookup method */
#define GGML_FP16_TO_FP32(x) GGML_COMPUTE_FP16_TO_FP32(x)
#define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
register float f;
register double d;
__asm__(
"mtfprd %0,%2\n"
"xscvhpdp %0,%0\n"
"frsp %1,%0\n" :
/* temp */ "=d"(d),
/* out */ "=f"(f):
/* in */ "r"(h));
return f;
}
static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
register double d;
register ggml_fp16_t r;
__asm__( /* xscvdphp can work on double or single precision */
"xscvdphp %0,%2\n"
"mffprd %1,%0\n" :
/* temp */ "=d"(d),
/* out */ "=r"(r):
/* in */ "f"(f));
return r;
}
#else
// FP16 <-> FP32
// ref: https://github.com/Maratyszcza/FP16
static inline float fp32_from_bits(uint32_t w) {
union {
uint32_t as_bits;
float as_value;
} fp32;
fp32.as_bits = w;
return fp32.as_value;
}
static inline uint32_t fp32_to_bits(float f) {
union {
float as_value;
uint32_t as_bits;
} fp32;
fp32.as_value = f;
return fp32.as_bits;
}
static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t two_w = w + w;
const uint32_t exp_offset = UINT32_C(0xE0) << 23;
#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)
const float exp_scale = 0x1.0p-112f;
#else
const float exp_scale = fp32_from_bits(UINT32_C(0x7800000));
#endif
const float normalized_value = fp32_from_bits((two_w >> 4) + exp_offset) * exp_scale;
const uint32_t magic_mask = UINT32_C(126) << 23;
const float magic_bias = 0.5f;
const float denormalized_value = fp32_from_bits((two_w >> 17) | magic_mask) - magic_bias;
const uint32_t denormalized_cutoff = UINT32_C(1) << 27;
const uint32_t result = sign |
(two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value) : fp32_to_bits(normalized_value));
return fp32_from_bits(result);
}
static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)
const float scale_to_inf = 0x1.0p+112f;
const float scale_to_zero = 0x1.0p-110f;
#else
const float scale_to_inf = fp32_from_bits(UINT32_C(0x77800000));
const float scale_to_zero = fp32_from_bits(UINT32_C(0x08800000));
#endif
float base = (fabsf(f) * scale_to_inf) * scale_to_zero;
const uint32_t w = fp32_to_bits(f);
const uint32_t shl1_w = w + w;
const uint32_t sign = w & UINT32_C(0x80000000);
uint32_t bias = shl1_w & UINT32_C(0xFF000000);
if (bias < UINT32_C(0x71000000)) {
bias = UINT32_C(0x71000000);
}
base = fp32_from_bits((bias >> 1) + UINT32_C(0x07800000)) + base;
const uint32_t bits = fp32_to_bits(base);
const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00);
const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF);
const uint32_t nonsign = exp_bits + mantissa_bits;
return (sign >> 16) | (shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00) : nonsign);
}
#define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
#define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)
#endif // __F16C__
#endif // __ARM_NEON
// precomputed f32 table for f16 (256 KB)
// defined in ggml.c, initialized in ggml_init()
extern float ggml_table_f32_f16[1 << 16];
// On ARM NEON, it's quicker to directly convert x -> x instead of calling into ggml_lookup_fp16_to_fp32,
// so we define GGML_FP16_TO_FP32 and GGML_FP32_TO_FP16 elsewhere for NEON.
// This is also true for POWER9.
#if !defined(GGML_FP16_TO_FP32) || !defined(GGML_FP32_TO_FP16)
inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
uint16_t s;
memcpy(&s, &f, sizeof(uint16_t));
return ggml_table_f32_f16[s];
}
#define GGML_FP16_TO_FP32(x) ggml_lookup_fp16_to_fp32(x)
#define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
#endif
#define GGML_HASHTABLE_FULL ((size_t)-1)
#define GGML_HASHTABLE_ALREADY_EXISTS ((size_t)-2)
bool ggml_hash_contains (const struct ggml_hash_set hash_set, struct ggml_tensor * key);
// returns GGML_HASHTABLE_FULL if table is full, otherwise the current index of the key or where it should be inserted
size_t ggml_hash_find (const struct ggml_hash_set hash_set, struct ggml_tensor * key);
// returns GGML_HAHSHTABLE_ALREADY_EXISTS if key already exists, index otherwise, asserts if table is full
size_t ggml_hash_insert ( struct ggml_hash_set hash_set, struct ggml_tensor * key);
// return index, asserts if table is full
size_t ggml_hash_find_or_insert( struct ggml_hash_set hash_set, struct ggml_tensor * key);
#ifdef __cplusplus
}
#endif

File diff suppressed because it is too large Load Diff

View File

@ -1,224 +0,0 @@
#pragma once
#include "ggml-impl.h"
// GGML internal header
#include <stdint.h>
#include <stddef.h>
#define QK4_0 32
typedef struct {
ggml_fp16_t d; // delta
uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;
static_assert(sizeof(block_q4_0) == sizeof(ggml_fp16_t) + QK4_0 / 2, "wrong q4_0 block size/padding");
#define QK4_1 32
typedef struct {
ggml_fp16_t d; // delta
ggml_fp16_t m; // min
uint8_t qs[QK4_1 / 2]; // nibbles / quants
} block_q4_1;
static_assert(sizeof(block_q4_1) == 2 * sizeof(ggml_fp16_t) + QK4_1 / 2, "wrong q4_1 block size/padding");
#define QK5_0 32
typedef struct {
ggml_fp16_t d; // delta
uint8_t qh[4]; // 5-th bit of quants
uint8_t qs[QK5_0 / 2]; // nibbles / quants
} block_q5_0;
static_assert(sizeof(block_q5_0) == sizeof(ggml_fp16_t) + sizeof(uint32_t) + QK5_0 / 2, "wrong q5_0 block size/padding");
#define QK5_1 32
typedef struct {
ggml_fp16_t d; // delta
ggml_fp16_t m; // min
uint8_t qh[4]; // 5-th bit of quants
uint8_t qs[QK5_1 / 2]; // nibbles / quants
} block_q5_1;
static_assert(sizeof(block_q5_1) == 2 * sizeof(ggml_fp16_t) + sizeof(uint32_t) + QK5_1 / 2, "wrong q5_1 block size/padding");
#define QK8_0 32
typedef struct {
ggml_fp16_t d; // delta
int8_t qs[QK8_0]; // quants
} block_q8_0;
static_assert(sizeof(block_q8_0) == sizeof(ggml_fp16_t) + QK8_0, "wrong q8_0 block size/padding");
#define QK8_1 32
typedef struct {
float d; // delta
float s; // d * sum(qs[i])
int8_t qs[QK8_1]; // quants
} block_q8_1;
static_assert(sizeof(block_q8_1) == 2*sizeof(float) + QK8_1, "wrong q8_1 block size/padding");
//
// Super-block quantization structures
//
// Super-block size
#ifdef GGML_QKK_64
#define QK_K 64
#define K_SCALE_SIZE 4
#else
#define QK_K 256
#define K_SCALE_SIZE 12
#endif
// 2-bit quantization
// weight is represented as x = a * q + b
// 16 blocks of 16 elements each
// Effectively 2.5625 bits per weight
typedef struct {
uint8_t scales[QK_K/16]; // scales and mins, quantized with 4 bits
uint8_t qs[QK_K/4]; // quants
ggml_fp16_t d; // super-block scale for quantized scales
ggml_fp16_t dmin; // super-block scale for quantized mins
} block_q2_K;
static_assert(sizeof(block_q2_K) == 2*sizeof(ggml_fp16_t) + QK_K/16 + QK_K/4, "wrong q2_K block size/padding");
// 3-bit quantization
// weight is represented as x = a * q
// 16 blocks of 16 elements each
// Effectively 3.4375 bits per weight
#ifdef GGML_QKK_64
typedef struct {
uint8_t hmask[QK_K/8]; // quants - high bit
uint8_t qs[QK_K/4]; // quants - low 2 bits
uint8_t scales[2];
ggml_fp16_t d; // super-block scale
} block_q3_K;
static_assert(sizeof(block_q3_K) == sizeof(ggml_fp16_t) + QK_K / 4 + QK_K / 8 + 2, "wrong q3_K block size/padding");
#else
typedef struct {
uint8_t hmask[QK_K/8]; // quants - high bit
uint8_t qs[QK_K/4]; // quants - low 2 bits
uint8_t scales[12]; // scales, quantized with 6 bits
ggml_fp16_t d; // super-block scale
} block_q3_K;
static_assert(sizeof(block_q3_K) == sizeof(ggml_fp16_t) + QK_K / 4 + QK_K / 8 + 12, "wrong q3_K block size/padding");
#endif
// 4-bit quantization
// 8 blocks of 32 elements each
// weight is represented as x = a * q + b
// Effectively 4.5 bits per weight
#ifdef GGML_QKK_64
typedef struct {
ggml_fp16_t d[2]; // super-block scales/mins
uint8_t scales[2]; // 4-bit block scales/mins
uint8_t qs[QK_K/2]; // 4--bit quants
} block_q4_K;
static_assert(sizeof(block_q4_K) == 2*sizeof(ggml_fp16_t) + QK_K/2 + 2, "wrong q4_K block size/padding");
#else
typedef struct {
ggml_fp16_t d; // super-block scale for quantized scales
ggml_fp16_t dmin; // super-block scale for quantized mins
uint8_t scales[K_SCALE_SIZE]; // scales and mins, quantized with 6 bits
uint8_t qs[QK_K/2]; // 4--bit quants
} block_q4_K;
static_assert(sizeof(block_q4_K) == 2*sizeof(ggml_fp16_t) + K_SCALE_SIZE + QK_K/2, "wrong q4_K block size/padding");
#endif
// 5-bit quantization
// 8 blocks of 32 elements each
// weight is represented as x = a * q + b
// Effectively 5.5 bits per weight
#ifdef GGML_QKK_64
typedef struct {
ggml_fp16_t d; // super-block scale
int8_t scales[QK_K/16]; // 8-bit block scales
uint8_t qh[QK_K/8]; // quants, high bit
uint8_t qs[QK_K/2]; // quants, low 4 bits
} block_q5_K;
static_assert(sizeof(block_q5_K) == sizeof(ggml_fp16_t) + QK_K/2 + QK_K/8 + QK_K/16, "wrong q5_K block size/padding");
#else
typedef struct {
ggml_fp16_t d; // super-block scale for quantized scales
ggml_fp16_t dmin; // super-block scale for quantized mins
uint8_t scales[K_SCALE_SIZE]; // scales and mins, quantized with 6 bits
uint8_t qh[QK_K/8]; // quants, high bit
uint8_t qs[QK_K/2]; // quants, low 4 bits
} block_q5_K;
static_assert(sizeof(block_q5_K) == 2*sizeof(ggml_fp16_t) + K_SCALE_SIZE + QK_K/2 + QK_K/8, "wrong q5_K block size/padding");
#endif
// 6-bit quantization
// weight is represented as x = a * q
// 16 blocks of 16 elements each
// Effectively 6.5625 bits per weight
typedef struct {
uint8_t ql[QK_K/2]; // quants, lower 4 bits
uint8_t qh[QK_K/4]; // quants, upper 2 bits
int8_t scales[QK_K/16]; // scales, quantized with 8 bits
ggml_fp16_t d; // super-block scale
} block_q6_K;
static_assert(sizeof(block_q6_K) == sizeof(ggml_fp16_t) + QK_K / 16 + 3*QK_K/4, "wrong q6_K block size/padding");
// This is only used for intermediate quantization and dot products
typedef struct {
float d; // delta
int8_t qs[QK_K]; // quants
int16_t bsums[QK_K/16]; // sum of quants in groups of 16
} block_q8_K;
static_assert(sizeof(block_q8_K) == sizeof(float) + QK_K + QK_K/16*sizeof(int16_t), "wrong q8_K block size/padding");
// Quantization
void quantize_row_q4_0_reference(const float * restrict x, block_q4_0 * restrict y, int k);
void quantize_row_q4_1_reference(const float * restrict x, block_q4_1 * restrict y, int k);
void quantize_row_q5_0_reference(const float * restrict x, block_q5_0 * restrict y, int k);
void quantize_row_q5_1_reference(const float * restrict x, block_q5_1 * restrict y, int k);
void quantize_row_q8_0_reference(const float * restrict x, block_q8_0 * restrict y, int k);
void quantize_row_q8_1_reference(const float * restrict x, block_q8_1 * restrict y, int k);
void quantize_row_q2_K_reference(const float * restrict x, block_q2_K * restrict y, int k);
void quantize_row_q3_K_reference(const float * restrict x, block_q3_K * restrict y, int k);
void quantize_row_q4_K_reference(const float * restrict x, block_q4_K * restrict y, int k);
void quantize_row_q5_K_reference(const float * restrict x, block_q5_K * restrict y, int k);
void quantize_row_q6_K_reference(const float * restrict x, block_q6_K * restrict y, int k);
void quantize_row_q8_K_reference(const float * restrict x, block_q8_K * restrict y, int k);
void quantize_row_q4_0(const float * restrict x, void * restrict y, int k);
void quantize_row_q4_1(const float * restrict x, void * restrict y, int k);
void quantize_row_q5_0(const float * restrict x, void * restrict y, int k);
void quantize_row_q5_1(const float * restrict x, void * restrict y, int k);
void quantize_row_q8_0(const float * restrict x, void * restrict y, int k);
void quantize_row_q8_1(const float * restrict x, void * restrict y, int k);
void quantize_row_q2_K(const float * restrict x, void * restrict y, int k);
void quantize_row_q3_K(const float * restrict x, void * restrict y, int k);
void quantize_row_q4_K(const float * restrict x, void * restrict y, int k);
void quantize_row_q5_K(const float * restrict x, void * restrict y, int k);
void quantize_row_q6_K(const float * restrict x, void * restrict y, int k);
void quantize_row_q8_K(const float * restrict x, void * restrict y, int k);
// Dequantization
void dequantize_row_q4_0(const block_q4_0 * restrict x, float * restrict y, int k);
void dequantize_row_q4_1(const block_q4_1 * restrict x, float * restrict y, int k);
void dequantize_row_q5_0(const block_q5_0 * restrict x, float * restrict y, int k);
void dequantize_row_q5_1(const block_q5_1 * restrict x, float * restrict y, int k);
void dequantize_row_q8_0(const block_q8_0 * restrict x, float * restrict y, int k);
//void dequantize_row_q8_1(const block_q8_1 * restrict x, float * restrict y, int k);
void dequantize_row_q2_K(const block_q2_K * restrict x, float * restrict y, int k);
void dequantize_row_q3_K(const block_q3_K * restrict x, float * restrict y, int k);
void dequantize_row_q4_K(const block_q4_K * restrict x, float * restrict y, int k);
void dequantize_row_q5_K(const block_q5_K * restrict x, float * restrict y, int k);
void dequantize_row_q6_K(const block_q6_K * restrict x, float * restrict y, int k);
void dequantize_row_q8_K(const block_q8_K * restrict x, float * restrict y, int k);
// Dot product
void ggml_vec_dot_q4_0_q8_0(int n, float * restrict s, const void * restrict vx, const void * restrict vy);
void ggml_vec_dot_q4_1_q8_1(int n, float * restrict s, const void * restrict vx, const void * restrict vy);
void ggml_vec_dot_q5_0_q8_0(int n, float * restrict s, const void * restrict vx, const void * restrict vy);
void ggml_vec_dot_q5_1_q8_1(int n, float * restrict s, const void * restrict vx, const void * restrict vy);
void ggml_vec_dot_q8_0_q8_0(int n, float * restrict s, const void * restrict vx, const void * restrict vy);
void ggml_vec_dot_q2_K_q8_K(int n, float * restrict s, const void * restrict vx, const void * restrict vy);
void ggml_vec_dot_q3_K_q8_K(int n, float * restrict s, const void * restrict vx, const void * restrict vy);
void ggml_vec_dot_q4_K_q8_K(int n, float * restrict s, const void * restrict vx, const void * restrict vy);
void ggml_vec_dot_q5_K_q8_K(int n, float * restrict s, const void * restrict vx, const void * restrict vy);
void ggml_vec_dot_q6_K_q8_K(int n, float * restrict s, const void * restrict vx, const void * restrict vy);

View File

@ -0,0 +1,14 @@
ggml-metal-embed.o: \
ggml-metal.metal \
ggml-common.h
@echo "Embedding Metal library"
@sed -e '/#include "ggml-common.h"/r ggml-common.h' -e '/#include "ggml-common.h"/d' < ggml-metal.metal > ggml-metal-embed.metal
$(eval TEMP_ASSEMBLY=$(shell mktemp))
@echo ".section __DATA, __ggml_metallib" > $(TEMP_ASSEMBLY)
@echo ".globl _ggml_metallib_start" >> $(TEMP_ASSEMBLY)
@echo "_ggml_metallib_start:" >> $(TEMP_ASSEMBLY)
@echo ".incbin \"ggml-metal-embed.metal\"" >> $(TEMP_ASSEMBLY)
@echo ".globl _ggml_metallib_end" >> $(TEMP_ASSEMBLY)
@echo "_ggml_metallib_end:" >> $(TEMP_ASSEMBLY)
@$(AS) $(TEMP_ASSEMBLY) -o $@
@rm -f ${TEMP_ASSEMBLY}

File diff suppressed because it is too large Load Diff

View File

@ -3,6 +3,13 @@
#include "whisper.h"
typedef struct {
VALUE *context;
VALUE user_data;
VALUE callback;
VALUE callbacks;
} ruby_whisper_callback_container;
typedef struct {
struct whisper_context *context;
} ruby_whisper;
@ -10,6 +17,9 @@ typedef struct {
typedef struct {
struct whisper_full_params params;
bool diarize;
ruby_whisper_callback_container *new_segment_callback_container;
ruby_whisper_callback_container *progress_callback_container;
ruby_whisper_callback_container *abort_callback_container;
} ruby_whisper_params;
#endif

View File

@ -0,0 +1,31 @@
---
- ../../src/whisper.cpp
- ../../include/whisper.h
- ../../ggml/src/ggml.c
- ../../ggml/src/ggml-cpu.c
- ../../ggml/src/ggml-impl.h
- ../../ggml/src/ggml-aarch64.h
- ../../ggml/src/ggml-aarch64.c
- ../../ggml/src/ggml-alloc.c
- ../../ggml/src/ggml-backend-impl.h
- ../../ggml/src/ggml-backend.cpp
- ../../ggml/src/ggml-common.h
- ../../ggml/src/ggml-quants.h
- ../../ggml/src/ggml-quants.c
- ../../ggml/src/ggml-cpu-impl.h
- ../../ggml/src/ggml-metal.m
- ../../ggml/src/ggml-metal.metal
- ../../ggml/src/ggml-blas.cpp
- ../../ggml/include/ggml.h
- ../../ggml/include/ggml-alloc.h
- ../../ggml/include/ggml-backend.h
- ../../ggml/include/ggml-cpu.h
- ../../ggml/include/ggml-cuda.h
- ../../ggml/include/ggml-kompute.h
- ../../ggml/include/ggml-metal.h
- ../../ggml/include/ggml-sycl.h
- ../../ggml/include/ggml-vulkan.h
- ../../ggml/include/ggml-blas.h
- ../../scripts/get-flags.mk
- ../../examples/dr_wav.h
- ../../LICENSE

View File

@ -0,0 +1,7 @@
require "test/unit"
require "whisper"
class TestBase < Test::Unit::TestCase
MODEL = File.join(__dir__, "..", "..", "..", "models", "ggml-base.en.bin")
AUDIO = File.join(__dir__, "..", "..", "..", "samples", "jfk.wav")
end

View File

@ -0,0 +1,163 @@
require "test/unit"
require "whisper"
class TestCallback < Test::Unit::TestCase
TOPDIR = File.expand_path(File.join(File.dirname(__FILE__), '..'))
def setup
GC.start
@params = Whisper::Params.new
@whisper = Whisper::Context.new(File.join(TOPDIR, '..', '..', 'models', 'ggml-base.en.bin'))
@audio = File.join(TOPDIR, '..', '..', 'samples', 'jfk.wav')
end
def test_new_segment_callback
@params.new_segment_callback = ->(context, state, n_new, user_data) {
assert_kind_of Integer, n_new
assert n_new > 0
assert_same @whisper, context
n_segments = context.full_n_segments
n_new.times do |i|
i_segment = n_segments - 1 + i
start_time = context.full_get_segment_t0(i_segment) * 10
end_time = context.full_get_segment_t1(i_segment) * 10
text = context.full_get_segment_text(i_segment)
assert_kind_of Integer, start_time
assert start_time >= 0
assert_kind_of Integer, end_time
assert end_time > 0
assert_match /ask not what your country can do for you, ask what you can do for your country/, text if i_segment == 0
end
}
@whisper.transcribe(@audio, @params)
end
def test_new_segment_callback_closure
search_word = "what"
@params.new_segment_callback = ->(context, state, n_new, user_data) {
n_segments = context.full_n_segments
n_new.times do |i|
i_segment = n_segments - 1 + i
text = context.full_get_segment_text(i_segment)
if text.include?(search_word)
t0 = context.full_get_segment_t0(i_segment)
t1 = context.full_get_segment_t1(i_segment)
raise "search word '#{search_word}' found at between #{t0} and #{t1}"
end
end
}
assert_raise RuntimeError do
@whisper.transcribe(@audio, @params)
end
end
def test_new_segment_callback_user_data
udata = Object.new
@params.new_segment_callback_user_data = udata
@params.new_segment_callback = ->(context, state, n_new, user_data) {
assert_same udata, user_data
}
@whisper.transcribe(@audio, @params)
end
def test_new_segment_callback_user_data_gc
@params.new_segment_callback_user_data = "My user data"
@params.new_segment_callback = ->(context, state, n_new, user_data) {
assert_equal "My user data", user_data
}
GC.start
assert_same @whisper, @whisper.transcribe(@audio, @params)
end
def test_progress_callback
first = nil
last = nil
@params.progress_callback = ->(context, state, progress, user_data) {
assert_kind_of Integer, progress
assert 0 <= progress && progress <= 100
assert_same @whisper, context
first = progress if first.nil?
last = progress
}
@whisper.transcribe(@audio, @params)
assert_equal 0, first
assert_equal 100, last
end
def test_progress_callback_user_data
udata = Object.new
@params.progress_callback_user_data = udata
@params.progress_callback = ->(context, state, n_new, user_data) {
assert_same udata, user_data
}
@whisper.transcribe(@audio, @params)
end
def test_on_progress
first = nil
last = nil
@params.on_progress do |progress|
assert_kind_of Integer, progress
assert 0 <= progress && progress <= 100
first = progress if first.nil?
last = progress
end
@whisper.transcribe(@audio, @params)
assert_equal 0, first
assert_equal 100, last
end
def test_abort_callback
i = 0
@params.abort_callback = ->(user_data) {
assert_nil user_data
i += 1
return false
}
@whisper.transcribe(@audio, @params)
assert i > 0
end
def test_abort_callback_abort
i = 0
@params.abort_callback = ->(user_data) {
i += 1
return i == 3
}
@whisper.transcribe(@audio, @params)
assert_equal 3, i
end
def test_abort_callback_user_data
udata = Object.new
@params.abort_callback_user_data = udata
yielded = nil
@params.abort_callback = ->(user_data) {
yielded = user_data
}
@whisper.transcribe(@audio, @params)
assert_same udata, yielded
end
def test_abort_on
do_abort = false
aborted_from_callback = false
@params.on_new_segment do |segment|
do_abort = true if segment.text.match? /ask/
end
i = 0
@params.abort_on do
i += 1
do_abort
end
@whisper.transcribe(@audio, @params)
assert i > 0
end
end

View File

@ -0,0 +1,44 @@
require_relative "helper"
class TestModel < TestBase
def test_model
whisper = Whisper::Context.new(MODEL)
assert_instance_of Whisper::Model, whisper.model
end
def test_attributes
whisper = Whisper::Context.new(MODEL)
model = whisper.model
assert_equal 51864, model.n_vocab
assert_equal 1500, model.n_audio_ctx
assert_equal 512, model.n_audio_state
assert_equal 8, model.n_audio_head
assert_equal 6, model.n_audio_layer
assert_equal 448, model.n_text_ctx
assert_equal 512, model.n_text_state
assert_equal 8, model.n_text_head
assert_equal 6, model.n_text_layer
assert_equal 80, model.n_mels
assert_equal 1, model.ftype
assert_equal "base", model.type
end
def test_gc
model = Whisper::Context.new(MODEL).model
GC.start
assert_equal 51864, model.n_vocab
assert_equal 1500, model.n_audio_ctx
assert_equal 512, model.n_audio_state
assert_equal 8, model.n_audio_head
assert_equal 6, model.n_audio_layer
assert_equal 448, model.n_text_ctx
assert_equal 512, model.n_text_state
assert_equal 8, model.n_text_head
assert_equal 6, model.n_text_layer
assert_equal 80, model.n_mels
assert_equal 1, model.ftype
assert_equal "base", model.type
end
end

View File

@ -0,0 +1,31 @@
require_relative "helper"
require 'tempfile'
require 'tmpdir'
require 'shellwords'
class TestPackage < TestBase
def test_build
Tempfile.create do |file|
assert system("gem", "build", "whispercpp.gemspec", "--output", file.to_path.shellescape, exception: true)
assert file.size > 0
assert_path_exist file.to_path
end
end
sub_test_case "Building binary on installation" do
def setup
system "rake", "build", exception: true
end
def test_install
match_data = `rake -Tbuild`.match(/(whispercpp-(.+)\.gem)/)
filename = match_data[1]
version = match_data[2]
basename = "whisper.#{RbConfig::CONFIG["DLEXT"]}"
Dir.mktmpdir do |dir|
system "gem", "install", "--install-dir", dir.shellescape, "pkg/#{filename.shellescape}", exception: true
assert_path_exist File.join(dir, "gems/whispercpp-#{version}/lib", basename)
end
end
end
end

View File

@ -0,0 +1,154 @@
require_relative "helper"
class TestParams < TestBase
def setup
@params = Whisper::Params.new
end
def test_language
@params.language = "en"
assert_equal @params.language, "en"
@params.language = "auto"
assert_equal @params.language, "auto"
end
def test_offset
@params.offset = 10_000
assert_equal @params.offset, 10_000
@params.offset = 0
assert_equal @params.offset, 0
end
def test_duration
@params.duration = 60_000
assert_equal @params.duration, 60_000
@params.duration = 0
assert_equal @params.duration, 0
end
def test_max_text_tokens
@params.max_text_tokens = 300
assert_equal @params.max_text_tokens, 300
@params.max_text_tokens = 0
assert_equal @params.max_text_tokens, 0
end
def test_translate
@params.translate = true
assert @params.translate
@params.translate = false
assert !@params.translate
end
def test_no_context
@params.no_context = true
assert @params.no_context
@params.no_context = false
assert !@params.no_context
end
def test_single_segment
@params.single_segment = true
assert @params.single_segment
@params.single_segment = false
assert !@params.single_segment
end
def test_print_special
@params.print_special = true
assert @params.print_special
@params.print_special = false
assert !@params.print_special
end
def test_print_progress
@params.print_progress = true
assert @params.print_progress
@params.print_progress = false
assert !@params.print_progress
end
def test_print_realtime
@params.print_realtime = true
assert @params.print_realtime
@params.print_realtime = false
assert !@params.print_realtime
end
def test_print_timestamps
@params.print_timestamps = true
assert @params.print_timestamps
@params.print_timestamps = false
assert !@params.print_timestamps
end
def test_suppress_blank
@params.suppress_blank = true
assert @params.suppress_blank
@params.suppress_blank = false
assert !@params.suppress_blank
end
def test_suppress_non_speech_tokens
@params.suppress_non_speech_tokens = true
assert @params.suppress_non_speech_tokens
@params.suppress_non_speech_tokens = false
assert !@params.suppress_non_speech_tokens
end
def test_token_timestamps
@params.token_timestamps = true
assert @params.token_timestamps
@params.token_timestamps = false
assert !@params.token_timestamps
end
def test_split_on_word
@params.split_on_word = true
assert @params.split_on_word
@params.split_on_word = false
assert !@params.split_on_word
end
def test_initial_prompt
assert_nil @params.initial_prompt
@params.initial_prompt = "You are a polite person."
assert_equal "You are a polite person.", @params.initial_prompt
end
def test_temperature
assert_equal 0.0, @params.temperature
@params.temperature = 0.5
assert_equal 0.5, @params.temperature
end
def test_max_initial_ts
assert_equal 1.0, @params.max_initial_ts
@params.max_initial_ts = 600.0
assert_equal 600.0, @params.max_initial_ts
end
def test_length_penalty
assert_equal -1.0, @params.length_penalty
@params.length_penalty = 0.5
assert_equal 0.5, @params.length_penalty
end
def test_temperature_inc
assert_in_delta 0.2, @params.temperature_inc
@params.temperature_inc = 0.5
assert_in_delta 0.5, @params.temperature_inc
end
def test_entropy_thold
assert_in_delta 2.4, @params.entropy_thold
@params.entropy_thold = 3.0
assert_in_delta 3.0, @params.entropy_thold
end
def test_logprob_thold
assert_in_delta -1.0, @params.logprob_thold
@params.logprob_thold = -0.5
assert_in_delta -0.5, @params.logprob_thold
end
end

View File

@ -0,0 +1,83 @@
require_relative "helper"
class TestSegment < TestBase
class << self
attr_reader :whisper
def startup
@whisper = Whisper::Context.new(TestBase::MODEL)
params = Whisper::Params.new
params.print_timestamps = false
@whisper.transcribe(TestBase::AUDIO, params)
end
end
def test_iteration
whisper.each_segment do |segment|
assert_instance_of Whisper::Segment, segment
end
end
def test_enumerator
enum = whisper.each_segment
assert_instance_of Enumerator, enum
enum.to_a.each_with_index do |segment, index|
assert_instance_of Whisper::Segment, segment
assert_kind_of Integer, index
end
end
def test_start_time
i = 0
whisper.each_segment do |segment|
assert_equal 0, segment.start_time if i == 0
i += 1
end
end
def test_end_time
i = 0
whisper.each_segment do |segment|
assert_equal whisper.full_get_segment_t1(i) * 10, segment.end_time
i += 1
end
end
def test_on_new_segment
params = Whisper::Params.new
seg = nil
index = 0
params.on_new_segment do |segment|
assert_instance_of Whisper::Segment, segment
if index == 0
seg = segment
assert_equal 0, segment.start_time
assert_match /ask not what your country can do for you, ask what you can do for your country/, segment.text
end
index += 1
end
whisper.transcribe(AUDIO, params)
assert_equal 0, seg.start_time
assert_match /ask not what your country can do for you, ask what you can do for your country/, seg.text
end
def test_on_new_segment_twice
params = Whisper::Params.new
seg = nil
params.on_new_segment do |segment|
seg = segment
return
end
params.on_new_segment do |segment|
assert_same seg, segment
return
end
whisper.transcribe(AUDIO, params)
end
private
def whisper
self.class.whisper
end
end

View File

@ -1,138 +1,127 @@
TOPDIR = File.expand_path(File.join(File.dirname(__FILE__), '..'))
EXTDIR = File.join(TOPDIR, 'ext')
#$LIBDIR = File.join(TOPDIR, 'lib')
#$:.unshift(LIBDIR)
$:.unshift(EXTDIR)
require_relative "helper"
require "stringio"
require 'whisper'
require 'test/unit'
# Exists to detect memory-related bug
Whisper.log_set ->(level, buffer, user_data) {}, nil
class TestWhisper < Test::Unit::TestCase
class TestWhisper < TestBase
def setup
@params = Whisper::Params.new
end
def test_language
@params.language = "en"
assert_equal @params.language, "en"
@params.language = "auto"
assert_equal @params.language, "auto"
end
def test_offset
@params.offset = 10_000
assert_equal @params.offset, 10_000
@params.offset = 0
assert_equal @params.offset, 0
end
def test_duration
@params.duration = 60_000
assert_equal @params.duration, 60_000
@params.duration = 0
assert_equal @params.duration, 0
end
def test_max_text_tokens
@params.max_text_tokens = 300
assert_equal @params.max_text_tokens, 300
@params.max_text_tokens = 0
assert_equal @params.max_text_tokens, 0
end
def test_translate
@params.translate = true
assert @params.translate
@params.translate = false
assert !@params.translate
end
def test_no_context
@params.no_context = true
assert @params.no_context
@params.no_context = false
assert !@params.no_context
end
def test_single_segment
@params.single_segment = true
assert @params.single_segment
@params.single_segment = false
assert !@params.single_segment
end
def test_print_special
@params.print_special = true
assert @params.print_special
@params.print_special = false
assert !@params.print_special
end
def test_print_progress
@params.print_progress = true
assert @params.print_progress
@params.print_progress = false
assert !@params.print_progress
end
def test_print_realtime
@params.print_realtime = true
assert @params.print_realtime
@params.print_realtime = false
assert !@params.print_realtime
end
def test_print_timestamps
@params.print_timestamps = true
assert @params.print_timestamps
@params.print_timestamps = false
assert !@params.print_timestamps
end
def test_suppress_blank
@params.suppress_blank = true
assert @params.suppress_blank
@params.suppress_blank = false
assert !@params.suppress_blank
end
def test_suppress_non_speech_tokens
@params.suppress_non_speech_tokens = true
assert @params.suppress_non_speech_tokens
@params.suppress_non_speech_tokens = false
assert !@params.suppress_non_speech_tokens
end
def test_token_timestamps
@params.token_timestamps = true
assert @params.token_timestamps
@params.token_timestamps = false
assert !@params.token_timestamps
end
def test_split_on_word
@params.split_on_word = true
assert @params.split_on_word
@params.split_on_word = false
assert !@params.split_on_word
end
def test_speed_up
@params.speed_up = true
assert @params.speed_up
@params.speed_up = false
assert !@params.speed_up
end
def test_whisper
@whisper = Whisper::Context.new(File.join(TOPDIR, '..', '..', 'models', 'ggml-base.en.bin'))
@whisper = Whisper::Context.new(MODEL)
params = Whisper::Params.new
params.print_timestamps = false
jfk = File.join(TOPDIR, '..', '..', 'samples', 'jfk.wav')
@whisper.transcribe(jfk, params) {|text|
@whisper.transcribe(AUDIO, params) {|text|
assert_match /ask not what your country can do for you, ask what you can do for your country/, text
}
end
sub_test_case "After transcription" do
class << self
attr_reader :whisper
def startup
@whisper = Whisper::Context.new(TestBase::MODEL)
params = Whisper::Params.new
params.print_timestamps = false
@whisper.transcribe(TestBase::AUDIO, params)
end
end
def whisper
self.class.whisper
end
def test_full_n_segments
assert_equal 1, whisper.full_n_segments
end
def test_full_lang_id
assert_equal 0, whisper.full_lang_id
end
def test_full_get_segment_t0
assert_equal 0, whisper.full_get_segment_t0(0)
assert_raise IndexError do
whisper.full_get_segment_t0(whisper.full_n_segments)
end
assert_raise IndexError do
whisper.full_get_segment_t0(-1)
end
end
def test_full_get_segment_t1
t1 = whisper.full_get_segment_t1(0)
assert_kind_of Integer, t1
assert t1 > 0
assert_raise IndexError do
whisper.full_get_segment_t1(whisper.full_n_segments)
end
end
def test_full_get_segment_speaker_turn_next
assert_false whisper.full_get_segment_speaker_turn_next(0)
end
def test_full_get_segment_text
assert_match /ask not what your country can do for you, ask what you can do for your country/, whisper.full_get_segment_text(0)
end
end
def test_lang_max_id
assert_kind_of Integer, Whisper.lang_max_id
end
def test_lang_id
assert_equal 0, Whisper.lang_id("en")
assert_raise ArgumentError do
Whisper.lang_id("non existing language")
end
end
def test_lang_str
assert_equal "en", Whisper.lang_str(0)
assert_raise IndexError do
Whisper.lang_str(Whisper.lang_max_id + 1)
end
end
def test_lang_str_full
assert_equal "english", Whisper.lang_str_full(0)
assert_raise IndexError do
Whisper.lang_str_full(Whisper.lang_max_id + 1)
end
end
def test_log_set
user_data = Object.new
logs = []
log_callback = ->(level, buffer, udata) {
logs << [level, buffer, udata]
}
Whisper.log_set log_callback, user_data
Whisper::Context.new(MODEL)
assert logs.length > 30
logs.each do |log|
assert_equal Whisper::LOG_LEVEL_INFO, log[0]
assert_same user_data, log[2]
end
end
def test_log_suppress
stderr = $stderr
Whisper.log_set ->(level, buffer, user_data) {
# do nothing
}, nil
dev = StringIO.new("")
$stderr = dev
Whisper::Context.new(MODEL)
assert_empty dev.string
ensure
$stderr = stderr
end
end

View File

@ -0,0 +1,36 @@
require "yaml"
Gem::Specification.new do |s|
s.name = "whispercpp"
s.authors = ["Georgi Gerganov", "Todd A. Fisher"]
s.version = '1.3.0'
s.date = '2024-05-14'
s.description = %q{High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model via Ruby}
s.email = 'todd.fisher@gmail.com'
s.extra_rdoc_files = ['LICENSE', 'README.md']
s.files = `git ls-files . -z`.split("\x0") +
YAML.load_file("extsources.yaml").collect {|file|
basename = File.basename(file)
if s.extra_rdoc_files.include?(basename)
basename
else
File.join("ext", basename)
end
}
s.summary = %q{Ruby whisper.cpp bindings}
s.test_files = ["tests/test_whisper.rb"]
s.extensions << 'ext/extconf.rb'
#### Documentation and testing.
s.homepage = 'https://github.com/ggerganov/whisper.cpp'
s.rdoc_options = ['--main', '../../README.md']
s.platform = Gem::Platform::RUBY
s.licenses = ['MIT']
end

View File

@ -1,54 +0,0 @@
# Add new build types
# ReleaseGG - Release with enabled asserts
SET(CMAKE_CXX_FLAGS_RELEASEGG
"-O3"
CACHE STRING "Flags used by the c++ compiler during release builds with enabled asserts."
FORCE )
SET(CMAKE_C_FLAGS_RELEASEGG
"-O3"
CACHE STRING "Flags used by the compiler during release builds with enabled asserts."
FORCE )
SET(CMAKE_EXE_LINKER_FLAGS_RELEASEGG
""
CACHE STRING "Flags used for linking binaries during release builds with enabled asserts."
FORCE )
SET(CMAKE_SHARED_LINKER_FLAGS_RELEASEGG
""
CACHE STRING "Flags used by the shared libraries linker during release builds with enabled asserts."
FORCE )
MARK_AS_ADVANCED(
CMAKE_CXX_FLAGS_RELEASEGG
CMAKE_C_FLAGS_RELEASEGG
CMAKE_EXE_LINKER_FLAGS_RELEASEGG
CMAKE_SHARED_LINKER_FLAGS_RELEASEGG )
# RelWithDebInfoGG - RelWithDebInfo with enabled asserts
SET(CMAKE_CXX_FLAGS_RELWITHDEBINFOGG
"-O2 -g"
CACHE STRING "Flags used by the c++ compiler during release builds with debug symbols and enabled asserts."
FORCE )
SET(CMAKE_C_FLAGS_RELWITHDEBINFOGG
"-O2 -g"
CACHE STRING "Flags used by the compiler during release builds with debug symbols and enabled asserts."
FORCE )
SET(CMAKE_EXE_LINKER_FLAGS_RELWITHDEBINFOGG
""
CACHE STRING "Flags used for linking binaries during release builds with debug symbols and enabled asserts."
FORCE )
SET(CMAKE_SHARED_LINKER_FLAGS_RELWITHDEBINFOGG
""
CACHE STRING "Flags used by the shared libraries linker during release builds with debug symbols and enabled asserts."
FORCE )
MARK_AS_ADVANCED(
CMAKE_CXX_FLAGS_RELWITHDEBINFOGG
CMAKE_C_FLAGS_RELWITHDEBINFOGG
CMAKE_EXE_LINKER_FLAGS_RELWITHDEBINFOGG
CMAKE_SHARED_LINKER_FLAGS_RELWITHDEBINFOGG )
if (NOT XCODE AND NOT MSVC AND NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "MinSizeRel" "RelWithDebInfo" "ReleaseGG" "RelWithDebInfoGG")
endif()

View File

@ -13,5 +13,5 @@ set_target_properties(${TARGET}
PROPERTIES
EXPORT_COMPILE_COMMANDS ON
RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/bin"
INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/lib"
INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/lib"
)

163
cmake/FindFFmpeg.cmake Normal file
View File

@ -0,0 +1,163 @@
# From
# https://github.com/snikulov/cmake-modules/blob/master/FindFFmpeg.cmake
#
# vim: ts=2 sw=2
# - Try to find the required ffmpeg components(default: AVFORMAT, AVUTIL, AVCODEC)
#
# Once done this will define
# FFMPEG_FOUND - System has the all required components.
# FFMPEG_INCLUDE_DIRS - Include directory necessary for using the required components headers.
# FFMPEG_LIBRARIES - Link these to use the required ffmpeg components.
# FFMPEG_DEFINITIONS - Compiler switches required for using the required ffmpeg components.
#
# For each of the components it will additionally set.
# - AVCODEC
# - AVDEVICE
# - AVFORMAT
# - AVFILTER
# - AVUTIL
# - POSTPROC
# - SWSCALE
# the following variables will be defined
# <component>_FOUND - System has <component>
# <component>_INCLUDE_DIRS - Include directory necessary for using the <component> headers
# <component>_LIBRARIES - Link these to use <component>
# <component>_DEFINITIONS - Compiler switches required for using <component>
# <component>_VERSION - The components version
#
# Copyright (c) 2006, Matthias Kretz, <kretz@kde.org>
# Copyright (c) 2008, Alexander Neundorf, <neundorf@kde.org>
# Copyright (c) 2011, Michael Jansen, <kde@michael-jansen.biz>
#
# Redistribution and use is allowed according to the terms of the BSD license.
# For details see the accompanying COPYING-CMAKE-SCRIPTS file.
include(FindPackageHandleStandardArgs)
# The default components were taken from a survey over other FindFFMPEG.cmake files
if (NOT FFmpeg_FIND_COMPONENTS)
set(FFmpeg_FIND_COMPONENTS AVFORMAT AVCODEC AVUTIL SWRESAMPLE)
endif()
#
### Macro: set_component_found
#
# Marks the given component as found if both *_LIBRARIES AND *_INCLUDE_DIRS is present.
#
macro(set_component_found _component )
if (${_component}_LIBRARIES AND ${_component}_INCLUDE_DIRS)
message(DEBUG " - ${_component} found.")
set(${_component}_FOUND TRUE)
else ()
message(DEBUG " - ${_component} not found.")
endif ()
endmacro()
#
### Macro: find_component
#
# Checks for the given component by invoking pkgconfig and then looking up the libraries and
# include directories.
#
macro(find_component _component _pkgconfig _library _header)
if (NOT WIN32)
# use pkg-config to get the directories and then use these values
# in the FIND_PATH() and FIND_LIBRARY() calls
find_package(PkgConfig)
if (PKG_CONFIG_FOUND)
pkg_check_modules(PC_${_component} ${_pkgconfig})
message(STATUS "Pkgconfig found: ${PC_${_component}_INCLUDEDIR}")
message(STATUS "Pkgconfig found: ${PC_${_component}_INCLUDE_DIRS}")
message(STATUS "${PC_${_component}_CFLAGS}")
endif ()
endif (NOT WIN32)
find_path(${_component}_INCLUDE_DIRS ${_header}
HINTS
${PC_${_component}_INCLUDEDIR}
${PC_${_component}_INCLUDE_DIRS}
PATH_SUFFIXES
ffmpeg
)
# CMake's default is to search first for shared libraries and then for static libraries.
# Todo later: add option to prefer static libs over dynamic:
find_library(${_component}_LIBRARIES NAMES ${_library} lib${_library}.a
HINTS
${PC_${_component}_LIBDIR}
${PC_${_component}_LIBRARY_DIRS}
)
set(${_component}_DEFINITIONS ${PC_${_component}_CFLAGS_OTHER} CACHE STRING "The ${_component} CFLAGS.")
set(${_component}_VERSION ${PC_${_component}_VERSION} CACHE STRING "The ${_component} version number.")
set_component_found(${_component})
mark_as_advanced(
${_component}_INCLUDE_DIRS
${_component}_LIBRARIES
${_component}_DEFINITIONS
${_component}_VERSION)
endmacro()
# Check for cached results. If there are skip the costly part.
if (NOT FFMPEG_LIBRARIES)
# Check for all possible component.
find_component(AVCODEC libavcodec avcodec libavcodec/avcodec.h)
find_component(AVFORMAT libavformat avformat libavformat/avformat.h)
find_component(AVDEVICE libavdevice avdevice libavdevice/avdevice.h)
#find_component(AVRESAMPLE libavresample avresample libavresample/avresample.h) # old name for swresample
find_component(AVUTIL libavutil avutil libavutil/avutil.h)
find_component(AVFILTER libavfilter avfilter libavfilter/avfilter.h)
find_component(SWSCALE libswscale swscale libswscale/swscale.h)
find_component(POSTPROC libpostproc postproc libpostproc/postprocess.h)
find_component(SWRESAMPLE libswresample swresample libswresample/swresample.h)
# Check if the required components were found and add their stuff to the FFMPEG_* vars.
foreach (_component ${FFmpeg_FIND_COMPONENTS})
if (${_component}_FOUND)
# message(STATUS "Required component ${_component} present.")
set(FFMPEG_LIBRARIES ${FFMPEG_LIBRARIES} ${${_component}_LIBRARIES})
set(FFMPEG_DEFINITIONS ${FFMPEG_DEFINITIONS} ${${_component}_DEFINITIONS})
list(APPEND FFMPEG_INCLUDE_DIRS ${${_component}_INCLUDE_DIRS})
else ()
# message(STATUS "Required component ${_component} missing.")
endif ()
endforeach ()
# Build the include path with duplicates removed.
if (FFMPEG_INCLUDE_DIRS)
list(REMOVE_DUPLICATES FFMPEG_INCLUDE_DIRS)
endif ()
# cache the vars.
set(FFMPEG_INCLUDE_DIRS ${FFMPEG_INCLUDE_DIRS} CACHE STRING "The FFmpeg include directories." FORCE)
set(FFMPEG_LIBRARIES ${FFMPEG_LIBRARIES} CACHE STRING "The FFmpeg libraries." FORCE)
set(FFMPEG_DEFINITIONS ${FFMPEG_DEFINITIONS} CACHE STRING "The FFmpeg cflags." FORCE)
mark_as_advanced(FFMPEG_INCLUDE_DIRS
FFMPEG_LIBRARIES
FFMPEG_DEFINITIONS)
endif ()
# Now set the noncached _FOUND vars for the components.
# whisper.cpp does not need SWSCALE
foreach (_component AVCODEC AVDEVICE AVFORMAT AVRESAMPLE AVUTIL POSTPROCESS)
set_component_found(${_component})
endforeach ()
# Compile the list of required vars
set(_FFmpeg_REQUIRED_VARS FFMPEG_LIBRARIES FFMPEG_INCLUDE_DIRS)
foreach (_component ${FFmpeg_FIND_COMPONENTS})
list(APPEND _FFmpeg_REQUIRED_VARS ${_component}_LIBRARIES ${_component}_INCLUDE_DIRS)
endforeach ()
# Give a nice error message if some of the required vars are missing.
find_package_handle_standard_args(FFmpeg DEFAULT_MSG ${_FFmpeg_REQUIRED_VARS})

58
cmake/build-info.cmake Normal file
View File

@ -0,0 +1,58 @@
set(BUILD_NUMBER 0)
set(BUILD_COMMIT "unknown")
set(BUILD_COMPILER "unknown")
set(BUILD_TARGET "unknown")
# Look for git
find_package(Git)
if(NOT Git_FOUND)
find_program(GIT_EXECUTABLE NAMES git git.exe)
if(GIT_EXECUTABLE)
set(Git_FOUND TRUE)
message(STATUS "Found Git: ${GIT_EXECUTABLE}")
else()
message(WARNING "Git not found. Build info will not be accurate.")
endif()
endif()
# Get the commit count and hash
if(Git_FOUND)
execute_process(
COMMAND ${GIT_EXECUTABLE} rev-parse --short HEAD
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
OUTPUT_VARIABLE HEAD
OUTPUT_STRIP_TRAILING_WHITESPACE
RESULT_VARIABLE RES
)
if (RES EQUAL 0)
set(BUILD_COMMIT ${HEAD})
endif()
execute_process(
COMMAND ${GIT_EXECUTABLE} rev-list --count HEAD
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
OUTPUT_VARIABLE COUNT
OUTPUT_STRIP_TRAILING_WHITESPACE
RESULT_VARIABLE RES
)
if (RES EQUAL 0)
set(BUILD_NUMBER ${COUNT})
endif()
endif()
if(MSVC)
set(BUILD_COMPILER "${CMAKE_C_COMPILER_ID} ${CMAKE_C_COMPILER_VERSION}")
set(BUILD_TARGET ${CMAKE_VS_PLATFORM_NAME})
else()
execute_process(
COMMAND sh -c "$@ --version | head -1" _ ${CMAKE_C_COMPILER}
OUTPUT_VARIABLE OUT
OUTPUT_STRIP_TRAILING_WHITESPACE
)
set(BUILD_COMPILER ${OUT})
execute_process(
COMMAND ${CMAKE_C_COMPILER} -dumpmachine
OUTPUT_VARIABLE OUT
OUTPUT_STRIP_TRAILING_WHITESPACE
)
set(BUILD_TARGET ${OUT})
endif()

View File

@ -0,0 +1,65 @@
set(WHISPER_VERSION @WHISPER_INSTALL_VERSION@)
set(WHISPER_BUILD_COMMIT @WHISPER_BUILD_COMMIT@)
set(WHISPER_BUILD_NUMBER @WHISPER_BUILD_NUMBER@)
set(WHISPER_SHARED_LIB @BUILD_SHARED_LIBS@)
set(GGML_BLAS @GGML_BLAS@)
set(GGML_CUDA @GGML_CUDA@)
set(GGML_METAL @GGML_METAL@)
set(GGML_HIPBLAS @GGML_HIPBLAS@)
set(GGML_ACCELERATE @GGML_ACCELERATE@)
@PACKAGE_INIT@
set_and_check(WHISPER_INCLUDE_DIR "@PACKAGE_WHISPER_INCLUDE_INSTALL_DIR@")
set_and_check(WHISPER_LIB_DIR "@PACKAGE_WHISPER_LIB_INSTALL_DIR@")
set_and_check(WHISPER_BIN_DIR "@PACKAGE_WHISPER_BIN_INSTALL_DIR@")
# Ensure transient dependencies satisfied
find_package(Threads REQUIRED)
if (APPLE AND GGML_ACCELERATE)
find_library(ACCELERATE_FRAMEWORK Accelerate REQUIRED)
endif()
if (GGML_BLAS)
find_package(BLAS REQUIRED)
endif()
if (GGML_CUDA)
find_package(CUDAToolkit REQUIRED)
endif()
if (GGML_METAL)
find_library(FOUNDATION_LIBRARY Foundation REQUIRED)
find_library(METAL_FRAMEWORK Metal REQUIRED)
find_library(METALKIT_FRAMEWORK MetalKit REQUIRED)
endif()
if (GGML_HIPBLAS)
find_package(hip REQUIRED)
find_package(hipblas REQUIRED)
find_package(rocblas REQUIRED)
endif()
find_library(whisper_LIBRARY whisper
REQUIRED
HINTS ${WHISPER_LIB_DIR})
set(_whisper_link_deps "Threads::Threads" "@WHISPER_EXTRA_LIBS@")
set(_whisper_transient_defines "@WHISPER_TRANSIENT_DEFINES@")
add_library(whisper UNKNOWN IMPORTED)
set_target_properties(whisper
PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES "${WHISPER_INCLUDE_DIR}"
INTERFACE_LINK_LIBRARIES "${_whisper_link_deps}"
INTERFACE_COMPILE_DEFINITIONS "${_whisper_transient_defines}"
IMPORTED_LINK_INTERFACE_LANGUAGES "CXX"
IMPORTED_LOCATION "${whisper_LIBRARY}"
INTERFACE_COMPILE_FEATURES cxx_std_11
POSITION_INDEPENDENT_CODE ON )
check_required_components(whisper)

10
cmake/whisper.pc.in Normal file
View File

@ -0,0 +1,10 @@
prefix=@CMAKE_INSTALL_PREFIX@
exec_prefix=${prefix}
libdir=@CMAKE_INSTALL_FULL_LIBDIR@
includedir=${prefix}/include
Name: whisper
Description: Port of OpenAI's Whisper model in C/C++
Version: @PROJECT_VERSION@
Libs: -L${libdir} -lwhisper
Cflags: -I${includedir}

View File

@ -11,25 +11,62 @@ if (WHISPER_SDL2)
string(STRIP "${SDL2_LIBRARIES}" SDL2_LIBRARIES)
message(STATUS "SDL2_INCLUDE_DIRS = ${SDL2_INCLUDE_DIRS}")
message(STATUS "SDL2_LIBRARIES = ${SDL2_LIBRARIES}")
message(STATUS "SDL2_LIBRARIES = ${SDL2_LIBRARIES}")
endif()
if (WHISPER_CLBLAST)
find_package(CLBlast REQUIRED)
endif()
# common
set(TARGET common)
unset(COMMON_EXTRA_LIBS)
if (WHISPER_FFMPEG)
# As of cmake 3.27, there is no official cmake support for FindFFmpeg.
# Consequnelty we added a FindFFmpeg.cmake script the cmake subfolder:
# whisper.cpp does not need the full ffmpeg libs, just AVFORMAT AVCODEC AVUTIL SWRESAMPLE
# libswresample performs highly optimized audio resampling, rematrixing and sample format conversion operations
# libavcodec provides a generic encoding/decoding framework and contains multiple decoders and encoders for audio, video and subtitle streams, and several bitstream filters.
# libavformat provides a generic framework for multiplexing and demultiplexing (muxing and demuxing) audio, video and subtitle streams.
find_package(FFmpeg REQUIRED)
if (NOT ${FFMPEG_FOUND})
message(FATAL_ERROR "Cannot find ffmpeg libs/headers")
endif()
message(STATUS "Found ffmpeg libs: ${FFMPEG_LIBRARIES}")
message(STATUS "Found ffmpeg headers in: ${FFMPEG_INCLUDE_DIRS}")
message(STATUS "ffmpeg definitions: ${FFMPEG_DEFINITIONS}")
message(STATUS "Found avformat ${AVFORMAT_VERSION}")
include_directories(${FFMPEG_INCLUDE_DIRS})
add_compile_definitions(WHISPER_FFMPEG)
list(APPEND COMMON_EXTRA_LIBS ${FFMPEG_LIBRARIES})
set(COMMON_SOURCES_FFMPEG ffmpeg-transcode.cpp)
endif()
add_library(${TARGET} STATIC
common.h
common.cpp
common-ggml.h
common-ggml.cpp
grammar-parser.h
grammar-parser.cpp
${COMMON_SOURCES_FFMPEG}
)
include(DefaultTargetOptions)
target_link_libraries(${TARGET} PRIVATE whisper)
target_link_libraries(${TARGET} PRIVATE whisper ${COMMON_EXTRA_LIBS})
set_target_properties(${TARGET} PROPERTIES POSITION_INDEPENDENT_CODE ON)
set_target_properties(${TARGET} PROPERTIES FOLDER "libs")
if (WHISPER_SDL2)
# common-sdl
@ -43,31 +80,69 @@ if (WHISPER_SDL2)
include(DefaultTargetOptions)
target_include_directories(${TARGET} PUBLIC ${SDL2_INCLUDE_DIRS})
target_link_libraries(${TARGET} PRIVATE ${SDL2_LIBRARIES})
target_include_directories(${TARGET} PUBLIC ${SDL2_INCLUDE_DIRS})
target_link_libraries (${TARGET} PRIVATE ${SDL2_LIBRARIES})
set_target_properties(${TARGET} PROPERTIES POSITION_INDEPENDENT_CODE ON)
set_target_properties(${TARGET} PROPERTIES FOLDER "libs")
endif()
# add json lib
add_library(json_cpp INTERFACE)
target_include_directories(json_cpp INTERFACE ${CMAKE_CURRENT_SOURCE_DIR})
# examples
include_directories(${CMAKE_CURRENT_SOURCE_DIR})
if (EMSCRIPTEN)
add_subdirectory(whisper.wasm)
set_target_properties(libmain PROPERTIES FOLDER "libs")
add_subdirectory(stream.wasm)
set_target_properties(libstream PROPERTIES FOLDER "libs")
add_subdirectory(command.wasm)
add_subdirectory(talk.wasm)
set_target_properties(libcommand PROPERTIES FOLDER "libs")
#add_subdirectory(talk.wasm)
#set_target_properties(libtalk PROPERTIES FOLDER "libs")
add_subdirectory(bench.wasm)
set_target_properties(libbench PROPERTIES FOLDER "libs")
elseif(CMAKE_JS_VERSION)
add_subdirectory(addon.node)
set_target_properties(addon.node PROPERTIES FOLDER "examples")
else()
add_subdirectory(main)
set_target_properties(main PROPERTIES FOLDER "examples")
if (WHISPER_SDL2)
add_subdirectory(stream)
set_target_properties(stream PROPERTIES FOLDER "examples")
endif (WHISPER_SDL2)
add_subdirectory(server)
set_target_properties(server PROPERTIES FOLDER "examples")
if (WHISPER_SDL2)
add_subdirectory(command)
set_target_properties(command PROPERTIES FOLDER "examples")
endif (WHISPER_SDL2)
add_subdirectory(bench)
set_target_properties(bench PROPERTIES FOLDER "examples")
add_subdirectory(quantize)
add_subdirectory(talk)
set_target_properties(quantize PROPERTIES FOLDER "examples")
if (WHISPER_SDL2)
# TODO: disabled until update
# https://github.com/ggerganov/whisper.cpp/issues/1818
#add_subdirectory(talk)
#set_target_properties(talk PROPERTIES FOLDER "examples")
add_subdirectory(talk-llama)
set_target_properties(talk-llama PROPERTIES FOLDER "examples")
add_subdirectory(lsp)
set_target_properties(lsp PROPERTIES FOLDER "examples")
if (GGML_SYCL)
add_subdirectory(sycl)
set_target_properties(ls-sycl-device PROPERTIES FOLDER "examples")
endif()
endif (WHISPER_SDL2)
endif()
if (WHISPER_SDL2)
add_subdirectory(wchess)
set_target_properties(wchess PROPERTIES FOLDER "examples")
endif (WHISPER_SDL2)

View File

@ -1,4 +1,4 @@
set(TARGET whisper-addon)
set(TARGET addon.node)
# Base settings
#==================================================================

View File

@ -14,14 +14,14 @@ npm install
Make sure it is in the project root directory and compiled with make-js.
```shell
npx cmake-js compile -T whisper-addon -B Release
npx cmake-js compile -T addon.node -B Release
```
For Electron addon and cmake-js options, you can see [cmake-js](https://github.com/cmake-js/cmake-js) and make very few configuration changes.
> Such as appointing special cmake path:
> ```shell
> npx cmake-js compile -c 'xxx/cmake' -T whisper-addon -B Release
> npx cmake-js compile -c 'xxx/cmake' -T addon.node -B Release
> ```
## Run

View File

@ -1,7 +1,7 @@
const path = require("path");
const { whisper } = require(path.join(
__dirname,
"../../../build/Release/whisper-addon"
"../../../build/Release/addon.node"
));
const { promisify } = require("util");
@ -12,6 +12,12 @@ const whisperParamsMock = {
model: path.join(__dirname, "../../../models/ggml-base.en.bin"),
fname_inp: path.join(__dirname, "../../../samples/jfk.wav"),
use_gpu: true,
flash_attn: false,
no_prints: true,
comma_in_time: false,
translate: true,
no_timestamps: false,
audio_ctx: 0,
};
describe("Run whisper.node", () => {

View File

@ -19,12 +19,12 @@ struct whisper_params {
int32_t max_len = 0;
int32_t best_of = 5;
int32_t beam_size = -1;
int32_t audio_ctx = 0;
float word_thold = 0.01f;
float entropy_thold = 2.4f;
float logprob_thold = -1.0f;
bool speed_up = false;
bool translate = false;
bool diarize = false;
bool output_txt = false;
@ -36,7 +36,10 @@ struct whisper_params {
bool print_colors = false;
bool print_progress = false;
bool no_timestamps = false;
bool no_prints = false;
bool use_gpu = true;
bool flash_attn = false;
bool comma_in_time = true;
std::string language = "en";
std::string prompt;
@ -44,6 +47,8 @@ struct whisper_params {
std::vector<std::string> fname_inp = {};
std::vector<std::string> fname_out = {};
std::vector<float> pcmf32 = {}; // mono-channel F32 PCM
};
struct whisper_print_user_data {
@ -52,27 +57,6 @@ struct whisper_print_user_data {
const std::vector<std::vector<float>> * pcmf32s;
};
// 500 -> 00:05.000
// 6000 -> 01:00.000
std::string to_timestamp(int64_t t, bool comma = false) {
int64_t msec = t * 10;
int64_t hr = msec / (1000 * 60 * 60);
msec = msec - hr * (1000 * 60 * 60);
int64_t min = msec / (1000 * 60);
msec = msec - min * (1000 * 60);
int64_t sec = msec / 1000;
msec = msec - sec * 1000;
char buf[32];
snprintf(buf, sizeof(buf), "%02d:%02d:%02d%s%03d", (int) hr, (int) min, (int) sec, comma ? "," : ".", (int) msec);
return std::string(buf);
}
int timestamp_to_sample(int64_t t, int n_samples) {
return std::max(0, std::min((int) n_samples - 1, (int) ((t*WHISPER_SAMPLE_RATE)/100)));
}
void whisper_print_segment_callback(struct whisper_context * ctx, struct whisper_state * state, int n_new, void * user_data) {
const auto & params = *((whisper_print_user_data *) user_data)->params;
const auto & pcmf32s = *((whisper_print_user_data *) user_data)->pcmf32s;
@ -104,8 +88,8 @@ void whisper_print_segment_callback(struct whisper_context * ctx, struct whisper
if (params.diarize && pcmf32s.size() == 2) {
const int64_t n_samples = pcmf32s[0].size();
const int64_t is0 = timestamp_to_sample(t0, n_samples);
const int64_t is1 = timestamp_to_sample(t1, n_samples);
const int64_t is0 = timestamp_to_sample(t0, n_samples, WHISPER_SAMPLE_RATE);
const int64_t is1 = timestamp_to_sample(t1, n_samples, WHISPER_SAMPLE_RATE);
double energy0 = 0.0f;
double energy1 = 0.0f;
@ -141,9 +125,15 @@ void whisper_print_segment_callback(struct whisper_context * ctx, struct whisper
}
}
void cb_log_disable(enum ggml_log_level, const char *, void *) {}
int run(whisper_params &params, std::vector<std::vector<std::string>> &result) {
if (params.fname_inp.empty()) {
fprintf(stderr, "error: no input files specified\n");
if (params.no_prints) {
whisper_log_set(cb_log_disable, NULL);
}
if (params.fname_inp.empty() && params.pcmf32.empty()) {
fprintf(stderr, "error: no input files or audio buffer specified\n");
return 2;
}
@ -154,8 +144,9 @@ int run(whisper_params &params, std::vector<std::vector<std::string>> &result) {
// whisper init
struct whisper_context_params cparams;
struct whisper_context_params cparams = whisper_context_default_params();
cparams.use_gpu = params.use_gpu;
cparams.flash_attn = params.flash_attn;
struct whisper_context * ctx = whisper_init_from_file_with_params(params.model.c_str(), cparams);
if (ctx == nullptr) {
@ -163,6 +154,14 @@ int run(whisper_params &params, std::vector<std::vector<std::string>> &result) {
return 3;
}
// if params.pcmf32 is provided, set params.fname_inp to "buffer"
// this is simpler than further modifications in the code
if (!params.pcmf32.empty()) {
fprintf(stderr, "info: using audio buffer as input\n");
params.fname_inp.clear();
params.fname_inp.emplace_back("buffer");
}
for (int f = 0; f < (int) params.fname_inp.size(); ++f) {
const auto fname_inp = params.fname_inp[f];
const auto fname_out = f < (int)params.fname_out.size() && !params.fname_out[f].empty() ? params.fname_out[f] : params.fname_inp[f];
@ -170,20 +169,25 @@ int run(whisper_params &params, std::vector<std::vector<std::string>> &result) {
std::vector<float> pcmf32; // mono-channel F32 PCM
std::vector<std::vector<float>> pcmf32s; // stereo-channel F32 PCM
if (!::read_wav(fname_inp, pcmf32, pcmf32s, params.diarize)) {
fprintf(stderr, "error: failed to read WAV file '%s'\n", fname_inp.c_str());
continue;
// read the input audio file if params.pcmf32 is not provided
if (params.pcmf32.empty()) {
if (!::read_wav(fname_inp, pcmf32, pcmf32s, params.diarize)) {
fprintf(stderr, "error: failed to read WAV file '%s'\n", fname_inp.c_str());
continue;
}
} else {
pcmf32 = params.pcmf32;
}
// print system information
{
if (!params.no_prints) {
fprintf(stderr, "\n");
fprintf(stderr, "system_info: n_threads = %d / %d | %s\n",
params.n_threads*params.n_processors, std::thread::hardware_concurrency(), whisper_print_system_info());
}
// print some info about the processing
{
if (!params.no_prints) {
fprintf(stderr, "\n");
if (!whisper_is_multilingual(ctx)) {
if (params.language != "en" || params.translate) {
@ -192,12 +196,13 @@ int run(whisper_params &params, std::vector<std::vector<std::string>> &result) {
fprintf(stderr, "%s: WARNING: model is not multilingual, ignoring language and translation options\n", __func__);
}
}
fprintf(stderr, "%s: processing '%s' (%d samples, %.1f sec), %d threads, %d processors, lang = %s, task = %s, timestamps = %d ...\n",
fprintf(stderr, "%s: processing '%s' (%d samples, %.1f sec), %d threads, %d processors, lang = %s, task = %s, timestamps = %d, audio_ctx = %d ...\n",
__func__, fname_inp.c_str(), int(pcmf32.size()), float(pcmf32.size())/WHISPER_SAMPLE_RATE,
params.n_threads, params.n_processors,
params.language.c_str(),
params.translate ? "translate" : "transcribe",
params.no_timestamps ? 0 : 1);
params.no_timestamps ? 0 : 1,
params.audio_ctx);
fprintf(stderr, "\n");
}
@ -224,14 +229,15 @@ int run(whisper_params &params, std::vector<std::vector<std::string>> &result) {
wparams.entropy_thold = params.entropy_thold;
wparams.logprob_thold = params.logprob_thold;
wparams.max_len = params.output_wts && params.max_len == 0 ? 60 : params.max_len;
wparams.speed_up = params.speed_up;
wparams.audio_ctx = params.audio_ctx;
wparams.greedy.best_of = params.best_of;
wparams.beam_search.beam_size = params.beam_size;
wparams.initial_prompt = params.prompt.c_str();
wparams.no_timestamps = params.no_timestamps;
whisper_print_user_data user_data = { &params, &pcmf32s };
// this callback is called on each new segment
@ -267,8 +273,8 @@ int run(whisper_params &params, std::vector<std::vector<std::string>> &result) {
const int64_t t0 = whisper_full_get_segment_t0(ctx, i);
const int64_t t1 = whisper_full_get_segment_t1(ctx, i);
result[i].emplace_back(to_timestamp(t0, true));
result[i].emplace_back(to_timestamp(t1, true));
result[i].emplace_back(to_timestamp(t0, params.comma_in_time));
result[i].emplace_back(to_timestamp(t1, params.comma_in_time));
result[i].emplace_back(text);
}
@ -319,11 +325,33 @@ Napi::Value whisper(const Napi::CallbackInfo& info) {
std::string model = whisper_params.Get("model").As<Napi::String>();
std::string input = whisper_params.Get("fname_inp").As<Napi::String>();
bool use_gpu = whisper_params.Get("use_gpu").As<Napi::Boolean>();
bool flash_attn = whisper_params.Get("flash_attn").As<Napi::Boolean>();
bool no_prints = whisper_params.Get("no_prints").As<Napi::Boolean>();
bool no_timestamps = whisper_params.Get("no_timestamps").As<Napi::Boolean>();
int32_t audio_ctx = whisper_params.Get("audio_ctx").As<Napi::Number>();
bool comma_in_time = whisper_params.Get("comma_in_time").As<Napi::Boolean>();
Napi::Value pcmf32Value = whisper_params.Get("pcmf32");
std::vector<float> pcmf32_vec;
if (pcmf32Value.IsTypedArray()) {
Napi::Float32Array pcmf32 = pcmf32Value.As<Napi::Float32Array>();
size_t length = pcmf32.ElementLength();
pcmf32_vec.reserve(length);
for (size_t i = 0; i < length; i++) {
pcmf32_vec.push_back(pcmf32[i]);
}
}
params.language = language;
params.model = model;
params.fname_inp.emplace_back(input);
params.use_gpu = use_gpu;
params.flash_attn = flash_attn;
params.no_prints = no_prints;
params.no_timestamps = no_timestamps;
params.audio_ctx = audio_ctx;
params.pcmf32 = pcmf32_vec;
params.comma_in_time = comma_in_time;
Napi::Function callback = info[1].As<Napi::Function>();
Worker* worker = new Worker(callback, params);

View File

@ -1,7 +1,7 @@
const path = require("path");
const { whisper } = require(path.join(
__dirname,
"../../build/Release/whisper-addon"
"../../build/Release/addon.node"
));
const { promisify } = require("util");
@ -10,15 +10,27 @@ const whisperAsync = promisify(whisper);
const whisperParams = {
language: "en",
model: path.join(__dirname, "../../models/ggml-base.en.bin"),
fname_inp: "../../samples/jfk.wav",
fname_inp: path.join(__dirname, "../../samples/jfk.wav"),
use_gpu: true,
flash_attn: false,
no_prints: true,
comma_in_time: false,
translate: true,
no_timestamps: false,
audio_ctx: 0,
};
const arguments = process.argv.slice(2);
const params = Object.fromEntries(
arguments.reduce((pre, item) => {
if (item.startsWith("--")) {
return [...pre, item.slice(2).split("=")];
const [key, value] = item.slice(2).split("=");
if (key === "audio_ctx") {
whisperParams[key] = parseInt(value);
} else {
whisperParams[key] = value;
}
return pre;
}
return pre;
}, [])
@ -33,5 +45,6 @@ for (const key in params) {
console.log("whisperParams =", whisperParams);
whisperAsync(whisperParams).then((result) => {
console.log(`Result from whisper: ${result}`);
console.log();
console.log(result);
});

View File

@ -1,5 +1,5 @@
{
"name": "whisper-addon",
"name": "addon.node",
"version": "0.0.0",
"description": "",
"main": "index.js",

View File

@ -8,16 +8,17 @@
// command-line parameters
struct whisper_params {
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
int32_t what = 0; // what to benchmark: 0 - whisper ecoder, 1 - memcpy, 2 - ggml_mul_mat
int32_t what = 0; // what to benchmark: 0 - whisper encoder, 1 - memcpy, 2 - ggml_mul_mat
std::string model = "models/ggml-base.en.bin";
bool use_gpu = true;
bool use_gpu = true;
bool flash_attn = false;
};
void whisper_print_usage(int argc, char ** argv, const whisper_params & params);
bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
static bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
for (int i = 1; i < argc; i++) {
std::string arg = argv[i];
@ -25,10 +26,11 @@ bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
whisper_print_usage(argc, argv, params);
exit(0);
}
else if (arg == "-t" || arg == "--threads") { params.n_threads = std::stoi(argv[++i]); }
else if (arg == "-m" || arg == "--model") { params.model = argv[++i]; }
else if (arg == "-w" || arg == "--what") { params.what = atoi(argv[++i]); }
else if (arg == "-ng" || arg == "--no-gpu") { params.use_gpu = false; }
else if (arg == "-t" || arg == "--threads") { params.n_threads = std::stoi(argv[++i]); }
else if (arg == "-m" || arg == "--model") { params.model = argv[++i]; }
else if (arg == "-w" || arg == "--what") { params.what = atoi(argv[++i]); }
else if (arg == "-ng" || arg == "--no-gpu") { params.use_gpu = false; }
else if (arg == "-fa" || arg == "--flash-attn") { params.flash_attn = true; }
else {
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
whisper_print_usage(argc, argv, params);
@ -49,17 +51,20 @@ void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params & para
fprintf(stderr, " -m FNAME, --model FNAME [%-7s] model path\n", params.model.c_str());
fprintf(stderr, " -w N, --what N [%-7d] what to benchmark:\n", params.what);
fprintf(stderr, " -ng, --no-gpu [%-7s] disable GPU\n", params.use_gpu ? "false" : "true");
fprintf(stderr, " -fa, --flash-attn [%-7s] enable flash attention\n", params.flash_attn ? "true" : "false");
fprintf(stderr, " %-7s 0 - whisper\n", "");
fprintf(stderr, " %-7s 1 - memcpy\n", "");
fprintf(stderr, " %-7s 2 - ggml_mul_mat\n", "");
fprintf(stderr, "\n");
}
int whisper_bench_full(const whisper_params & params) {
static int whisper_bench_full(const whisper_params & params) {
// whisper init
struct whisper_context_params cparams;
cparams.use_gpu = params.use_gpu;
struct whisper_context_params cparams = whisper_context_default_params();
cparams.use_gpu = params.use_gpu;
cparams.flash_attn = params.flash_attn;
struct whisper_context * ctx = whisper_init_from_file_with_params(params.model.c_str(), cparams);
@ -81,7 +86,7 @@ int whisper_bench_full(const whisper_params & params) {
}
// heat encoder
if (int ret = whisper_encode(ctx, 0, params.n_threads) != 0) {
fprintf(stderr, "error: failed to encode model: %d\n", ret);
fprintf(stderr, "error: failed to encode: %d\n", ret);
return 4;
}
@ -90,13 +95,13 @@ int whisper_bench_full(const whisper_params & params) {
// prompt heat
if (int ret = whisper_decode(ctx, tokens, 256, 0, params.n_threads) != 0) {
fprintf(stderr, "error: failed to encode model: %d\n", ret);
fprintf(stderr, "error: failed to decode: %d\n", ret);
return 4;
}
// text-generation heat
if (int ret = whisper_decode(ctx, tokens, 1, 256, params.n_threads) != 0) {
fprintf(stderr, "error: failed to encode model: %d\n", ret);
fprintf(stderr, "error: failed to decode: %d\n", ret);
return 4;
}
@ -104,20 +109,30 @@ int whisper_bench_full(const whisper_params & params) {
// actual run
if (int ret = whisper_encode(ctx, 0, params.n_threads) != 0) {
fprintf(stderr, "error: failed to encode model: %d\n", ret);
fprintf(stderr, "error: failed to encode: %d\n", ret);
return 4;
}
for (int i = 0; i < 16; i++) {
if (int ret = whisper_decode(ctx, tokens, 256, 0, params.n_threads) != 0) {
fprintf(stderr, "error: failed to encode model: %d\n", ret);
// text-generation
for (int i = 0; i < 256; i++) {
if (int ret = whisper_decode(ctx, tokens, 1, i, params.n_threads) != 0) {
fprintf(stderr, "error: failed to decode: %d\n", ret);
return 4;
}
}
for (int i = 0; i < 256; i++) {
if (int ret = whisper_decode(ctx, tokens, 1, i, params.n_threads) != 0) {
fprintf(stderr, "error: failed to encode model: %d\n", ret);
// batched decoding
for (int i = 0; i < 64; i++) {
if (int ret = whisper_decode(ctx, tokens, 5, 0, params.n_threads) != 0) {
fprintf(stderr, "error: failed to decode: %d\n", ret);
return 4;
}
}
// prompt processing
for (int i = 0; i < 16; i++) {
if (int ret = whisper_decode(ctx, tokens, 256, 0, params.n_threads) != 0) {
fprintf(stderr, "error: failed to decode: %d\n", ret);
return 4;
}
}

View File

@ -37,9 +37,13 @@ https://user-images.githubusercontent.com/1991296/207435352-8fc4ed3f-bde5-4555-9
The `command` tool depends on SDL2 library to capture audio from the microphone. You can build it like this:
```bash
# Install SDL2 on Linux
# Install SDL2
# On Debian based linux distributions:
sudo apt-get install libsdl2-dev
# On Fedora Linux:
sudo dnf install SDL2 SDL2-devel
# Install SDL2 on Mac OS
brew install sdl2

View File

@ -9,6 +9,7 @@
#include "common-sdl.h"
#include "common.h"
#include "whisper.h"
#include "grammar-parser.h"
#include <sstream>
#include <cassert>
@ -30,26 +31,35 @@ struct whisper_params {
int32_t max_tokens = 32;
int32_t audio_ctx = 0;
float vad_thold = 0.6f;
float freq_thold = 100.0f;
float vad_thold = 0.6f;
float freq_thold = 100.0f;
float grammar_penalty = 100.0f;
grammar_parser::parse_state grammar_parsed;
bool speed_up = false;
bool translate = false;
bool print_special = false;
bool print_energy = false;
bool no_timestamps = true;
bool use_gpu = true;
bool flash_attn = false;
std::string language = "en";
std::string model = "models/ggml-base.en.bin";
std::string fname_out;
std::string commands;
std::string prompt;
std::string context;
std::string grammar;
// A regular expression that matches tokens to suppress
std::string suppress_regex;
};
void whisper_print_usage(int argc, char ** argv, const whisper_params & params);
bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
static bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
for (int i = 1; i < argc; i++) {
std::string arg = argv[i];
@ -65,16 +75,20 @@ bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
else if (arg == "-ac" || arg == "--audio-ctx") { params.audio_ctx = std::stoi(argv[++i]); }
else if (arg == "-vth" || arg == "--vad-thold") { params.vad_thold = std::stof(argv[++i]); }
else if (arg == "-fth" || arg == "--freq-thold") { params.freq_thold = std::stof(argv[++i]); }
else if (arg == "-su" || arg == "--speed-up") { params.speed_up = true; }
else if (arg == "-tr" || arg == "--translate") { params.translate = true; }
else if (arg == "-ps" || arg == "--print-special") { params.print_special = true; }
else if (arg == "-pe" || arg == "--print-energy") { params.print_energy = true; }
else if (arg == "-ng" || arg == "--no-gpu") { params.use_gpu = false; }
else if (arg == "-fa" || arg == "--flash-attn") { params.flash_attn = true; }
else if (arg == "-l" || arg == "--language") { params.language = argv[++i]; }
else if (arg == "-m" || arg == "--model") { params.model = argv[++i]; }
else if (arg == "-f" || arg == "--file") { params.fname_out = argv[++i]; }
else if (arg == "-cmd" || arg == "--commands") { params.commands = argv[++i]; }
else if (arg == "-p" || arg == "--prompt") { params.prompt = argv[++i]; }
else if (arg == "-ctx" || arg == "--context") { params.context = argv[++i]; }
else if ( arg == "--grammar") { params.grammar = argv[++i]; }
else if ( arg == "--grammar-penalty") { params.grammar_penalty = std::stof(argv[++i]); }
else if ( arg == "--suppress-regex") { params.suppress_regex = argv[++i]; }
else {
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
whisper_print_usage(argc, argv, params);
@ -99,26 +113,41 @@ void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params & para
fprintf(stderr, " -ac N, --audio-ctx N [%-7d] audio context size (0 - all)\n", params.audio_ctx);
fprintf(stderr, " -vth N, --vad-thold N [%-7.2f] voice activity detection threshold\n", params.vad_thold);
fprintf(stderr, " -fth N, --freq-thold N [%-7.2f] high-pass frequency cutoff\n", params.freq_thold);
fprintf(stderr, " -su, --speed-up [%-7s] speed up audio by x2 (reduced accuracy)\n", params.speed_up ? "true" : "false");
fprintf(stderr, " -tr, --translate [%-7s] translate from source language to english\n", params.translate ? "true" : "false");
fprintf(stderr, " -ps, --print-special [%-7s] print special tokens\n", params.print_special ? "true" : "false");
fprintf(stderr, " -pe, --print-energy [%-7s] print sound energy (for debugging)\n", params.print_energy ? "true" : "false");
fprintf(stderr, " -ng, --no-gpu [%-7s] disable GPU\n", params.use_gpu ? "false" : "true");
fprintf(stderr, " -fa, --flash-attn [%-7s] flash attention\n", params.flash_attn ? "true" : "false");
fprintf(stderr, " -l LANG, --language LANG [%-7s] spoken language\n", params.language.c_str());
fprintf(stderr, " -m FNAME, --model FNAME [%-7s] model path\n", params.model.c_str());
fprintf(stderr, " -f FNAME, --file FNAME [%-7s] text output file name\n", params.fname_out.c_str());
fprintf(stderr, " -cmd FNAME, --commands FNAME [%-7s] text file with allowed commands\n", params.commands.c_str());
fprintf(stderr, " -p, --prompt [%-7s] the required activation prompt\n", params.prompt.c_str());
fprintf(stderr, " -ctx, --context [%-7s] sample text to help the transcription\n", params.context.c_str());
fprintf(stderr, " --grammar GRAMMAR [%-7s] GBNF grammar to guide decoding\n", params.grammar.c_str());
fprintf(stderr, " --grammar-penalty N [%-7.1f] scales down logits of nongrammar tokens\n", params.grammar_penalty);
fprintf(stderr, " --suppress-regex REGEX [%-7s] regular expression matching tokens to suppress\n", params.suppress_regex.c_str());
fprintf(stderr, "\n");
}
std::string transcribe(whisper_context * ctx, const whisper_params & params, const std::vector<float> & pcmf32, float & prob, int64_t & t_ms) {
static std::string transcribe(
whisper_context * ctx,
const whisper_params & params,
const std::vector<float> & pcmf32,
const std::string & grammar_rule,
float & logprob_min,
float & logprob_sum,
int & n_tokens,
int64_t & t_ms) {
const auto t_start = std::chrono::high_resolution_clock::now();
prob = 0.0f;
logprob_min = 0.0f;
logprob_sum = 0.0f;
n_tokens = 0;
t_ms = 0;
whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
//whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_BEAM_SEARCH);
wparams.print_progress = false;
wparams.print_special = params.print_special;
@ -126,19 +155,42 @@ std::string transcribe(whisper_context * ctx, const whisper_params & params, con
wparams.print_timestamps = !params.no_timestamps;
wparams.translate = params.translate;
wparams.no_context = true;
wparams.no_timestamps = params.no_timestamps;
wparams.single_segment = true;
wparams.max_tokens = params.max_tokens;
wparams.language = params.language.c_str();
wparams.n_threads = params.n_threads;
wparams.audio_ctx = params.audio_ctx;
wparams.speed_up = params.speed_up;
wparams.audio_ctx = params.audio_ctx;
wparams.temperature = 0.4f;
wparams.temperature_inc = 1.0f;
wparams.greedy.best_of = 5;
wparams.beam_search.beam_size = 5;
wparams.initial_prompt = params.context.data();
wparams.suppress_regex = params.suppress_regex.c_str();
const auto & grammar_parsed = params.grammar_parsed;
auto grammar_rules = grammar_parsed.c_rules();
if (!params.grammar_parsed.rules.empty() && !grammar_rule.empty()) {
if (grammar_parsed.symbol_ids.find(grammar_rule) == grammar_parsed.symbol_ids.end()) {
fprintf(stderr, "%s: warning: grammar rule '%s' not found - skipping grammar sampling\n", __func__, grammar_rule.c_str());
} else {
wparams.grammar_rules = grammar_rules.data();
wparams.n_grammar_rules = grammar_rules.size();
wparams.i_start_rule = grammar_parsed.symbol_ids.at(grammar_rule);
wparams.grammar_penalty = params.grammar_penalty;
}
}
if (whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size()) != 0) {
return "";
}
int prob_n = 0;
std::string result;
const int n_segments = whisper_full_n_segments(ctx);
@ -147,26 +199,24 @@ std::string transcribe(whisper_context * ctx, const whisper_params & params, con
result += text;
const int n_tokens = whisper_full_n_tokens(ctx, i);
for (int j = 0; j < n_tokens; ++j) {
const int n = whisper_full_n_tokens(ctx, i);
for (int j = 0; j < n; ++j) {
const auto token = whisper_full_get_token_data(ctx, i, j);
prob += token.p;
++prob_n;
if(token.plog > 0.0f) exit(0);
logprob_min = std::min(logprob_min, token.plog);
logprob_sum += token.plog;
++n_tokens;
}
}
if (prob_n > 0) {
prob /= prob_n;
}
const auto t_end = std::chrono::high_resolution_clock::now();
t_ms = std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_start).count();
return result;
}
std::vector<std::string> read_allowed_commands(const std::string & fname) {
static std::vector<std::string> read_allowed_commands(const std::string & fname) {
std::vector<std::string> allowed_commands;
std::ifstream ifs(fname);
@ -188,7 +238,7 @@ std::vector<std::string> read_allowed_commands(const std::string & fname) {
return allowed_commands;
}
std::vector<std::string> get_words(const std::string &txt) {
static std::vector<std::string> get_words(const std::string &txt) {
std::vector<std::string> words;
std::istringstream iss(txt);
@ -202,7 +252,7 @@ std::vector<std::string> get_words(const std::string &txt) {
// command-list mode
// guide the transcription to match the most likely command from a provided list
int process_command_list(struct whisper_context * ctx, audio_async &audio, const whisper_params &params) {
static int process_command_list(struct whisper_context * ctx, audio_async &audio, const whisper_params &params) {
fprintf(stderr, "\n");
fprintf(stderr, "%s: guided mode\n", __func__);
@ -250,7 +300,7 @@ int process_command_list(struct whisper_context * ctx, audio_async &audio, const
fprintf(stderr, " ]\n");
}
std::string k_prompt = "select one from the available words: ";
std::string k_prompt = "select one from the available words: ";
for (int i = 0; i < (int) allowed_commands.size(); ++i) {
if (i > 0) {
k_prompt += ", ";
@ -317,7 +367,6 @@ int process_command_list(struct whisper_context * ctx, audio_async &audio, const
wparams.n_threads = params.n_threads;
wparams.audio_ctx = params.audio_ctx;
wparams.speed_up = params.speed_up;
wparams.prompt_tokens = k_tokens.data();
wparams.prompt_n_tokens = k_tokens.size();
@ -414,11 +463,13 @@ int process_command_list(struct whisper_context * ctx, audio_async &audio, const
// always-prompt mode
// transcribe the voice into text after valid prompt
int always_prompt_transcription(struct whisper_context * ctx, audio_async & audio, const whisper_params & params) {
static int always_prompt_transcription(struct whisper_context * ctx, audio_async & audio, const whisper_params & params) {
bool is_running = true;
bool ask_prompt = true;
float prob = 0.0f;
float logprob_min = 0.0f;
float logprob_sum = 0.0f;
int n_tokens = 0;
std::vector<float> pcmf32_cur;
@ -456,7 +507,7 @@ int always_prompt_transcription(struct whisper_context * ctx, audio_async & audi
// detect the commands
audio.get(params.command_ms, pcmf32_cur);
const auto txt = ::trim(::transcribe(ctx, params, pcmf32_cur, prob, t_ms));
const auto txt = ::trim(::transcribe(ctx, params, pcmf32_cur, "", logprob_min, logprob_sum, n_tokens, t_ms));
const auto words = get_words(txt);
@ -492,18 +543,27 @@ int always_prompt_transcription(struct whisper_context * ctx, audio_async & audi
// general-purpose mode
// freely transcribe the voice into text
int process_general_transcription(struct whisper_context * ctx, audio_async &audio, const whisper_params &params) {
static int process_general_transcription(struct whisper_context * ctx, audio_async & audio, const whisper_params & params) {
bool is_running = true;
bool have_prompt = false;
bool ask_prompt = true;
float prob0 = 0.0f;
float prob = 0.0f;
float logprob_min0 = 0.0f;
float logprob_min = 0.0f;
float logprob_sum0 = 0.0f;
float logprob_sum = 0.0f;
int n_tokens0 = 0;
int n_tokens = 0;
std::vector<float> pcmf32_cur;
std::vector<float> pcmf32_prompt;
const std::string k_prompt = "Ok Whisper, start listening for commands.";
std::string k_prompt = "Ok Whisper, start listening for commands.";
if (!params.prompt.empty()) {
k_prompt = params.prompt;
}
fprintf(stderr, "\n");
fprintf(stderr, "%s: general-purpose mode\n", __func__);
@ -536,9 +596,11 @@ int process_general_transcription(struct whisper_context * ctx, audio_async &aud
// wait for activation phrase
audio.get(params.prompt_ms, pcmf32_cur);
const auto txt = ::trim(::transcribe(ctx, params, pcmf32_cur, prob0, t_ms));
const auto txt = ::trim(::transcribe(ctx, params, pcmf32_cur, "prompt", logprob_min0, logprob_sum0, n_tokens0, t_ms));
fprintf(stdout, "%s: Heard '%s%s%s', (t = %d ms)\n", __func__, "\033[1m", txt.c_str(), "\033[0m", (int) t_ms);
const float p = 100.0f * std::exp(logprob_min0);
fprintf(stdout, "%s: Heard '%s%s%s', (t = %d ms, p = %.2f%%)\n", __func__, "\033[1m", txt.c_str(), "\033[0m", (int) t_ms, p);
const float sim = similarity(txt, k_prompt);
@ -559,19 +621,30 @@ int process_general_transcription(struct whisper_context * ctx, audio_async &aud
// we have heard the activation phrase, now detect the commands
audio.get(params.command_ms, pcmf32_cur);
//printf("len prompt: %.4f\n", pcmf32_prompt.size() / (float) WHISPER_SAMPLE_RATE);
//printf("len command: %.4f\n", pcmf32_cur.size() / (float) WHISPER_SAMPLE_RATE);
// prepend 3 second of silence
pcmf32_cur.insert(pcmf32_cur.begin(), 3.0f*WHISPER_SAMPLE_RATE, 0.0f);
// prepend the prompt audio
pcmf32_cur.insert(pcmf32_cur.begin(), pcmf32_prompt.begin(), pcmf32_prompt.end());
const auto txt = ::trim(::transcribe(ctx, params, pcmf32_cur, prob, t_ms));
const auto txt = ::trim(::transcribe(ctx, params, pcmf32_cur, "root", logprob_min, logprob_sum, n_tokens, t_ms));
prob = 100.0f*(prob - prob0);
//const float p = 100.0f * std::exp((logprob - logprob0) / (n_tokens - n_tokens0));
const float p = 100.0f * std::exp(logprob_min);
//fprintf(stdout, "%s: heard '%s'\n", __func__, txt.c_str());
// find the prompt in the text
float best_sim = 0.0f;
size_t best_len = 0;
for (int n = 0.8*k_prompt.size(); n <= 1.2*k_prompt.size(); ++n) {
for (size_t n = 0.8*k_prompt.size(); n <= 1.2*k_prompt.size(); ++n) {
if (n >= txt.size()) {
break;
}
const auto prompt = txt.substr(0, n);
const float sim = similarity(prompt, k_prompt);
@ -584,9 +657,16 @@ int process_general_transcription(struct whisper_context * ctx, audio_async &aud
}
}
const std::string command = ::trim(txt.substr(best_len));
fprintf(stdout, "%s: DEBUG: txt = '%s', prob = %.2f%%\n", __func__, txt.c_str(), p);
if (best_len == 0) {
fprintf(stdout, "%s: WARNING: command not recognized, try again\n", __func__);
} else {
// cut the prompt from the decoded text
const std::string command = ::trim(txt.substr(best_len));
fprintf(stdout, "%s: Command '%s%s%s', (t = %d ms)\n", __func__, "\033[1m", command.c_str(), "\033[0m", (int) t_ms);
}
fprintf(stdout, "%s: Command '%s%s%s', (t = %d ms)\n", __func__, "\033[1m", command.c_str(), "\033[0m", (int) t_ms);
fprintf(stdout, "\n");
}
@ -613,8 +693,10 @@ int main(int argc, char ** argv) {
// whisper init
struct whisper_context_params cparams;
cparams.use_gpu = params.use_gpu;
struct whisper_context_params cparams = whisper_context_default_params();
cparams.use_gpu = params.use_gpu;
cparams.flash_attn = params.flash_attn;
struct whisper_context * ctx = whisper_init_from_file_with_params(params.model.c_str(), cparams);
@ -654,12 +736,36 @@ int main(int argc, char ** argv) {
int ret_val = 0;
if (!params.commands.empty()) {
ret_val = process_command_list(ctx, audio, params);
} else if (!params.prompt.empty()) {
ret_val = always_prompt_transcription(ctx, audio, params);
} else {
ret_val = process_general_transcription(ctx, audio, params);
if (!params.grammar.empty()) {
auto & grammar = params.grammar_parsed;
if (is_file_exist(params.grammar.c_str())) {
// read grammar from file
std::ifstream ifs(params.grammar.c_str());
const std::string txt = std::string((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());
grammar = grammar_parser::parse(txt.c_str());
} else {
// read grammar from string
grammar = grammar_parser::parse(params.grammar.c_str());
}
// will be empty (default) if there are parse errors
if (grammar.rules.empty()) {
ret_val = 1;
} else {
fprintf(stderr, "%s: grammar:\n", __func__);
grammar_parser::print_grammar(stderr, grammar);
fprintf(stderr, "\n");
}
}
if (ret_val == 0) {
if (!params.commands.empty()) {
ret_val = process_command_list(ctx, audio, params);
} else if (!params.prompt.empty() && params.grammar_parsed.rules.empty()) {
ret_val = always_prompt_transcription(ctx, audio, params);
} else {
ret_val = process_general_transcription(ctx, audio, params);
}
}
audio.pause();

View File

@ -9,6 +9,11 @@ static const std::map<std::string, enum ggml_ftype> GGML_FTYPE_MAP = {
{"q5_0", GGML_FTYPE_MOSTLY_Q5_0},
{"q5_1", GGML_FTYPE_MOSTLY_Q5_1},
{"q8_0", GGML_FTYPE_MOSTLY_Q8_0},
{"q2_k", GGML_FTYPE_MOSTLY_Q2_K},
{"q3_k", GGML_FTYPE_MOSTLY_Q3_K},
{"q4_k", GGML_FTYPE_MOSTLY_Q4_K},
{"q5_k", GGML_FTYPE_MOSTLY_Q5_K},
{"q6_k", GGML_FTYPE_MOSTLY_Q6_K},
};
void ggml_print_ftypes(FILE * fp) {
@ -48,15 +53,28 @@ bool ggml_common_quantize_0(
case GGML_FTYPE_MOSTLY_Q5_0: qtype = GGML_TYPE_Q5_0; break;
case GGML_FTYPE_MOSTLY_Q5_1: qtype = GGML_TYPE_Q5_1; break;
case GGML_FTYPE_MOSTLY_Q8_0: qtype = GGML_TYPE_Q8_0; break;
case GGML_FTYPE_MOSTLY_Q2_K: qtype = GGML_TYPE_Q2_K; break;
case GGML_FTYPE_MOSTLY_Q3_K: qtype = GGML_TYPE_Q3_K; break;
case GGML_FTYPE_MOSTLY_Q4_K: qtype = GGML_TYPE_Q4_K; break;
case GGML_FTYPE_MOSTLY_Q5_K: qtype = GGML_TYPE_Q5_K; break;
case GGML_FTYPE_MOSTLY_Q6_K: qtype = GGML_TYPE_Q6_K; break;
case GGML_FTYPE_UNKNOWN:
case GGML_FTYPE_ALL_F32:
case GGML_FTYPE_MOSTLY_F16:
case GGML_FTYPE_MOSTLY_Q4_1_SOME_F16:
case GGML_FTYPE_MOSTLY_Q2_K:
case GGML_FTYPE_MOSTLY_Q3_K:
case GGML_FTYPE_MOSTLY_Q4_K:
case GGML_FTYPE_MOSTLY_Q5_K:
case GGML_FTYPE_MOSTLY_Q6_K:
case GGML_FTYPE_MOSTLY_IQ2_XXS:
case GGML_FTYPE_MOSTLY_IQ2_XS:
case GGML_FTYPE_MOSTLY_IQ2_S:
case GGML_FTYPE_MOSTLY_IQ3_XXS:
case GGML_FTYPE_MOSTLY_IQ3_S:
case GGML_FTYPE_MOSTLY_IQ1_S:
case GGML_FTYPE_MOSTLY_IQ4_NL:
case GGML_FTYPE_MOSTLY_IQ4_XS:
case GGML_FTYPE_MOSTLY_IQ1_M:
case GGML_FTYPE_MOSTLY_BF16:
case GGML_FTYPE_MOSTLY_Q4_0_4_4:
case GGML_FTYPE_MOSTLY_Q4_0_4_8:
case GGML_FTYPE_MOSTLY_Q4_0_8_8:
{
fprintf(stderr, "%s: invalid model type %d\n", __func__, ftype);
return false;
@ -77,8 +95,6 @@ bool ggml_common_quantize_0(
std::vector<ggml_fp16_t> data_f16;
std::vector<float> data_f32;
std::vector<int64_t> hist_all(1 << 4, 0);
while (true) {
int32_t n_dims;
int32_t length;
@ -163,41 +179,44 @@ bool ggml_common_quantize_0(
work.resize(nelements); // for quantization
size_t cur_size = 0;
std::vector<int64_t> hist_cur(1 << 4, 0);
switch ((ggml_type) ttype) {
case GGML_TYPE_Q4_0:
{
cur_size = ggml_quantize_q4_0(data_f32.data(), work.data(), nelements, ne[0], hist_cur.data());
} break;
case GGML_TYPE_Q4_1:
{
cur_size = ggml_quantize_q4_1(data_f32.data(), work.data(), nelements, ne[0], hist_cur.data());
} break;
case GGML_TYPE_Q5_0:
{
cur_size = ggml_quantize_q5_0(data_f32.data(), work.data(), nelements, ne[0], hist_cur.data());
} break;
case GGML_TYPE_Q5_1:
{
cur_size = ggml_quantize_q5_1(data_f32.data(), work.data(), nelements, ne[0], hist_cur.data());
} break;
case GGML_TYPE_Q8_0:
case GGML_TYPE_Q2_K:
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
case GGML_TYPE_Q6_K:
{
cur_size = ggml_quantize_q8_0(data_f32.data(), work.data(), nelements, ne[0], hist_cur.data());
cur_size = ggml_quantize_chunk((ggml_type) ttype, data_f32.data(), work.data(), 0, nelements/ne[0], ne[0], nullptr);
} break;
case GGML_TYPE_F32:
case GGML_TYPE_F16:
case GGML_TYPE_I8:
case GGML_TYPE_I16:
case GGML_TYPE_I32:
case GGML_TYPE_I64:
case GGML_TYPE_F64:
case GGML_TYPE_Q8_1:
case GGML_TYPE_Q2_K:
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
case GGML_TYPE_Q6_K:
case GGML_TYPE_Q8_K:
case GGML_TYPE_IQ2_XXS:
case GGML_TYPE_IQ2_XS:
case GGML_TYPE_IQ2_S:
case GGML_TYPE_IQ3_XXS:
case GGML_TYPE_IQ3_S:
case GGML_TYPE_IQ1_S:
case GGML_TYPE_IQ4_NL:
case GGML_TYPE_IQ4_XS:
case GGML_TYPE_IQ1_M:
case GGML_TYPE_BF16:
case GGML_TYPE_Q4_0_4_4:
case GGML_TYPE_Q4_0_4_8:
case GGML_TYPE_Q4_0_8_8:
case GGML_TYPE_TQ1_0:
case GGML_TYPE_TQ2_0:
case GGML_TYPE_COUNT:
{
fprintf(stderr, "%s: unsupported quantization type %d (%s)\n", __func__, ttype, ggml_type_name((ggml_type) ttype));
@ -208,15 +227,7 @@ bool ggml_common_quantize_0(
fout.write(reinterpret_cast<char *>(work.data()), cur_size);
total_size_new += cur_size;
printf("size = %8.2f MB -> %8.2f MB | hist: ", nelements * sizeof(float)/1024.0/1024.0, cur_size/1024.0/1024.0);
for (int i = 0; i < (int) hist_cur.size(); ++i) {
hist_all[i] += hist_cur[i];
}
for (int i = 0; i < (int) hist_cur.size(); ++i) {
printf("%5.3f ", hist_cur[i] / (float)nelements);
}
printf("\n");
printf("size = %8.2f MB -> %8.2f MB\n", nelements * sizeof(float)/1024.0/1024.0, cur_size/1024.0/1024.0);
} else {
printf("size = %8.3f MB\n", data_u8.size()/1024.0/1024.0);
fout.write(reinterpret_cast<char *>(data_u8.data()), data_u8.size());
@ -229,18 +240,5 @@ bool ggml_common_quantize_0(
printf("%s: model size = %8.2f MB\n", __func__, total_size_org/1024.0/1024.0);
printf("%s: quant size = %8.2f MB | ftype = %d (%s)\n", __func__, total_size_new/1024.0/1024.0, ftype, ggml_type_name(qtype));
{
int64_t sum_all = 0;
for (int i = 0; i < (int) hist_all.size(); ++i) {
sum_all += hist_all[i];
}
printf("%s: hist: ", __func__);
for (int i = 0; i < (int) hist_all.size(); ++i) {
printf("%5.3f ", hist_all[i] / (float)sum_all);
}
printf("\n");
}
return true;
}

View File

@ -139,10 +139,13 @@ void audio_async::callback(uint8_t * stream, int len) {
return;
}
const size_t n_samples = len / sizeof(float);
size_t n_samples = len / sizeof(float);
m_audio_new.resize(n_samples);
memcpy(m_audio_new.data(), stream, n_samples * sizeof(float));
if (n_samples > m_audio.size()) {
n_samples = m_audio.size();
stream += (len - (n_samples * sizeof(float)));
}
//fprintf(stderr, "%s: %zu samples, pos %zu, len %zu\n", __func__, n_samples, m_audio_pos, m_audio_len);
@ -153,7 +156,7 @@ void audio_async::callback(uint8_t * stream, int len) {
const size_t n0 = m_audio.size() - m_audio_pos;
memcpy(&m_audio[m_audio_pos], stream, n0 * sizeof(float));
memcpy(&m_audio[0], &stream[n0], (n_samples - n0) * sizeof(float));
memcpy(&m_audio[0], stream + n0 * sizeof(float), (n_samples - n0) * sizeof(float));
m_audio_pos = (m_audio_pos + n_samples) % m_audio.size();
m_audio_len = m_audio.size();
@ -216,7 +219,7 @@ bool sdl_poll_events() {
case SDL_QUIT:
{
return false;
} break;
}
default:
break;
}

View File

@ -41,7 +41,6 @@ private:
std::mutex m_mutex;
std::vector<float> m_audio;
std::vector<float> m_audio_new;
size_t m_audio_pos = 0;
size_t m_audio_len = 0;
};

View File

@ -19,8 +19,18 @@
#pragma warning(disable: 4244 4267) // possible loss of data
#endif
#ifdef _WIN32
#include <fcntl.h>
#include <io.h>
#endif
#ifdef WHISPER_FFMPEG
// as implemented in ffmpeg_trancode.cpp only embedded in common lib if whisper built with ffmpeg support
extern bool ffmpeg_decode_audio(const std::string & ifname, std::vector<uint8_t> & wav_data);
#endif
// Function to check if the next argument exists
std::string get_next_arg(int& i, int argc, char** argv, const std::string& flag, gpt_params& params) {
static std::string get_next_arg(int& i, int argc, char** argv, const std::string& flag, gpt_params& params) {
if (i + 1 < argc && argv[i + 1][0] != '-') {
return argv[++i];
} else {
@ -137,7 +147,6 @@ std::string gpt_random_prompt(std::mt19937 & rng) {
case 7: return "He";
case 8: return "She";
case 9: return "They";
default: return "To";
}
return "The";
@ -336,7 +345,7 @@ std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::stri
return tokens;
}
std::vector<gpt_vocab::id> parse_tokens_from_string(const std::string& input, char delimiter) {
static std::vector<gpt_vocab::id> parse_tokens_from_string(const std::string& input, char delimiter) {
std::vector<gpt_vocab::id> output;
std::stringstream ss(input);
std::string token;
@ -348,7 +357,7 @@ std::vector<gpt_vocab::id> parse_tokens_from_string(const std::string& input, ch
return output;
}
std::map<std::string, std::vector<gpt_vocab::id>> extract_tests_from_file(const std::string & fpath_test){
static std::map<std::string, std::vector<gpt_vocab::id>> extract_tests_from_file(const std::string & fpath_test){
if (fpath_test.empty()){
fprintf(stderr, "%s : No test file found.\n", __func__);
return std::map<std::string, std::vector<gpt_vocab::id>>();
@ -615,12 +624,31 @@ gpt_vocab::id gpt_sample_top_k_top_p_repeat(
}
bool is_wav_buffer(const std::string buf) {
// RIFF ref: https://en.wikipedia.org/wiki/Resource_Interchange_File_Format
// WAV ref: https://www.mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html
if (buf.size() < 12 || buf.substr(0, 4) != "RIFF" || buf.substr(8, 4) != "WAVE") {
return false;
}
uint32_t chunk_size = *reinterpret_cast<const uint32_t*>(buf.data() + 4);
if (chunk_size + 8 != buf.size()) {
return false;
}
return true;
}
bool read_wav(const std::string & fname, std::vector<float>& pcmf32, std::vector<std::vector<float>>& pcmf32s, bool stereo) {
drwav wav;
std::vector<uint8_t> wav_data; // used for pipe input from stdin
std::vector<uint8_t> wav_data; // used for pipe input from stdin or ffmpeg decoding output
if (fname == "-") {
{
#ifdef _WIN32
_setmode(_fileno(stdin), _O_BINARY);
#endif
uint8_t buf[1024];
while (true)
{
@ -639,28 +667,49 @@ bool read_wav(const std::string & fname, std::vector<float>& pcmf32, std::vector
fprintf(stderr, "%s: read %zu bytes from stdin\n", __func__, wav_data.size());
}
else if (is_wav_buffer(fname)) {
if (drwav_init_memory(&wav, fname.c_str(), fname.size(), nullptr) == false) {
fprintf(stderr, "error: failed to open WAV file from fname buffer\n");
return false;
}
}
else if (drwav_init_file(&wav, fname.c_str(), nullptr) == false) {
#if defined(WHISPER_FFMPEG)
if (ffmpeg_decode_audio(fname, wav_data) != 0) {
fprintf(stderr, "error: failed to ffmpeg decode '%s' \n", fname.c_str());
return false;
}
if (drwav_init_memory(&wav, wav_data.data(), wav_data.size(), nullptr) == false) {
fprintf(stderr, "error: failed to read wav data as wav \n");
return false;
}
#else
fprintf(stderr, "error: failed to open '%s' as WAV file\n", fname.c_str());
return false;
#endif
}
if (wav.channels != 1 && wav.channels != 2) {
fprintf(stderr, "%s: WAV file '%s' must be mono or stereo\n", __func__, fname.c_str());
drwav_uninit(&wav);
return false;
}
if (stereo && wav.channels != 2) {
fprintf(stderr, "%s: WAV file '%s' must be stereo for diarization\n", __func__, fname.c_str());
drwav_uninit(&wav);
return false;
}
if (wav.sampleRate != COMMON_SAMPLE_RATE) {
fprintf(stderr, "%s: WAV file '%s' must be %i kHz\n", __func__, fname.c_str(), COMMON_SAMPLE_RATE/1000);
drwav_uninit(&wav);
return false;
}
if (wav.bitsPerSample != 16) {
fprintf(stderr, "%s: WAV file '%s' must be 16-bit\n", __func__, fname.c_str());
drwav_uninit(&wav);
return false;
}
@ -815,3 +864,48 @@ void sam_print_usage(int /*argc*/, char ** argv, const sam_params & params) {
fprintf(stderr, " output file (default: %s)\n", params.fname_out.c_str());
fprintf(stderr, "\n");
}
// 500 -> 00:05.000
// 6000 -> 01:00.000
std::string to_timestamp(int64_t t, bool comma) {
int64_t msec = t * 10;
int64_t hr = msec / (1000 * 60 * 60);
msec = msec - hr * (1000 * 60 * 60);
int64_t min = msec / (1000 * 60);
msec = msec - min * (1000 * 60);
int64_t sec = msec / 1000;
msec = msec - sec * 1000;
char buf[32];
snprintf(buf, sizeof(buf), "%02d:%02d:%02d%s%03d", (int) hr, (int) min, (int) sec, comma ? "," : ".", (int) msec);
return std::string(buf);
}
int timestamp_to_sample(int64_t t, int n_samples, int whisper_sample_rate) {
return std::max(0, std::min((int) n_samples - 1, (int) ((t*whisper_sample_rate)/100)));
}
bool is_file_exist(const char *fileName)
{
std::ifstream infile(fileName);
return infile.good();
}
bool speak_with_file(const std::string & command, const std::string & text, const std::string & path, int voice_id)
{
std::ofstream speak_file(path.c_str());
if (speak_file.fail()) {
fprintf(stderr, "%s: failed to open speak_file\n", __func__);
return false;
} else {
speak_file.write(text.c_str(), text.size());
speak_file.close();
int ret = system((command + " " + std::to_string(voice_id) + " " + path).c_str());
if (ret != 0) {
fprintf(stderr, "%s: failed to speak\n", __func__);
return false;
}
}
return true;
}

View File

@ -9,6 +9,7 @@
#include <thread>
#include <ctime>
#include <fstream>
#include <sstream>
#define COMMON_SAMPLE_RATE 16000
@ -21,7 +22,7 @@ struct gpt_params {
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
int32_t n_predict = 200; // new tokens to predict
int32_t n_parallel = 1; // number of parallel streams
int32_t n_batch = 8; // batch size for prompt processing
int32_t n_batch = 32; // batch size for prompt processing
int32_t n_ctx = 2048; // context size (this is the KV cache max size)
int32_t n_gpu_layers = 0; // number of layers to offlload to the GPU
@ -135,7 +136,11 @@ gpt_vocab::id gpt_sample_top_k_top_p_repeat(
// Audio utils
//
// Check if a buffer is a WAV audio file
bool is_wav_buffer(const std::string buf);
// Read WAV audio file and store the PCM data into pcmf32
// fname can be a buffer of WAV data instead of a filename
// The sample rate of the audio must be equal to COMMON_SAMPLE_RATE
// If stereo flag is set and the audio has 2 channels, the pcmf32s will contain 2 channel PCM
bool read_wav(
@ -181,7 +186,7 @@ private:
// It is assumed that PCM data is normalized to a range from -1 to 1
bool write_audio(const float * data, size_t length) {
for (size_t i = 0; i < length; ++i) {
const auto intSample = static_cast<const int16_t>(data[i] * 32767);
const int16_t intSample = int16_t(data[i] * 32767);
file.write(reinterpret_cast<const char *>(&intSample), sizeof(int16_t));
dataSize += sizeof(int16_t);
}
@ -277,3 +282,62 @@ struct sam_params {
bool sam_params_parse(int argc, char ** argv, sam_params & params);
void sam_print_usage(int argc, char ** argv, const sam_params & params);
//
// Terminal utils
//
#define SQR(X) ((X) * (X))
#define UNCUBE(x) x < 48 ? 0 : x < 115 ? 1 : (x - 35) / 40
/**
* Quantizes 24-bit RGB to xterm256 code range [16,256).
*/
static int rgb2xterm256(int r, int g, int b) {
unsigned char cube[] = {0, 0137, 0207, 0257, 0327, 0377};
int av, ir, ig, ib, il, qr, qg, qb, ql;
av = r * .299 + g * .587 + b * .114 + .5;
ql = (il = av > 238 ? 23 : (av - 3) / 10) * 10 + 8;
qr = cube[(ir = UNCUBE(r))];
qg = cube[(ig = UNCUBE(g))];
qb = cube[(ib = UNCUBE(b))];
if (SQR(qr - r) + SQR(qg - g) + SQR(qb - b) <=
SQR(ql - r) + SQR(ql - g) + SQR(ql - b))
return ir * 36 + ig * 6 + ib + 020;
return il + 0350;
}
static std::string set_xterm256_foreground(int r, int g, int b) {
int x = rgb2xterm256(r, g, b);
std::ostringstream oss;
oss << "\033[38;5;" << x << "m";
return oss.str();
}
// Lowest is red, middle is yellow, highest is green. Color scheme from
// Paul Tol; it is colorblind friendly https://personal.sron.nl/~pault/
const std::vector<std::string> k_colors = {
set_xterm256_foreground(220, 5, 12),
set_xterm256_foreground(232, 96, 28),
set_xterm256_foreground(241, 147, 45),
set_xterm256_foreground(246, 193, 65),
set_xterm256_foreground(247, 240, 86),
set_xterm256_foreground(144, 201, 135),
set_xterm256_foreground( 78, 178, 101),
};
//
// Other utils
//
// convert timestamp to string, 6000 -> 01:00.000
std::string to_timestamp(int64_t t, bool comma = false);
// given a timestamp get the sample
int timestamp_to_sample(int64_t t, int n_samples, int whisper_sample_rate);
// check if file exists using ifstream
bool is_file_exist(const char *fileName);
// write text to file, and call system("command voice_id file")
bool speak_with_file(const std::string & command, const std::string & text, const std::string & path, int voice_id);

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,348 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
* transcode.c - convert audio file to WAVE
*
* Copyright (C) 2019 Andrew Clayton <andrew@digital-domain.net>
* Copyright (C) 2024 William Tambellini <william.tambellini@gmail.com>
*/
// Just for conveninent C++ API
#include <vector>
#include <string>
// C
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
extern "C" {
#include <libavutil/opt.h>
#include <libavcodec/avcodec.h>
#include <libavformat/avformat.h>
#include <libswresample/swresample.h>
}
typedef uint64_t u64;
typedef int64_t s64;
typedef uint32_t u32;
typedef int32_t s32;
typedef uint16_t u16;
typedef int16_t s16;
typedef uint8_t u8;
typedef int8_t s8;
#define WAVE_SAMPLE_RATE 16000
#define AVIO_CTX_BUF_SZ 4096
static const char* ffmpegLog = getenv("FFMPEG_LOG");
// Todo: add __FILE__ __LINE__
#define LOG(...) \
do { if (ffmpegLog) fprintf(stderr, __VA_ARGS__); } while(0) // C99
/*
* WAVE file header based on definition from
* https://gist.github.com/Jon-Schneider/8b7c53d27a7a13346a643dac9c19d34f
*
* We must ensure this structure doesn't have any holes or
* padding so we can just map it straight to the WAVE data.
*/
struct wave_hdr {
/* RIFF Header: "RIFF" */
char riff_header[4];
/* size of audio data + sizeof(struct wave_hdr) - 8 */
int wav_size;
/* "WAVE" */
char wav_header[4];
/* Format Header */
/* "fmt " (includes trailing space) */
char fmt_header[4];
/* Should be 16 for PCM */
int fmt_chunk_size;
/* Should be 1 for PCM. 3 for IEEE Float */
s16 audio_format;
s16 num_channels;
int sample_rate;
/*
* Number of bytes per second
* sample_rate * num_channels * bit_depth/8
*/
int byte_rate;
/* num_channels * bytes per sample */
s16 sample_alignment;
/* bits per sample */
s16 bit_depth;
/* Data Header */
/* "data" */
char data_header[4];
/*
* size of audio
* number of samples * num_channels * bit_depth/8
*/
int data_bytes;
} __attribute__((__packed__));
struct audio_buffer {
u8 *ptr;
int size; /* size left in the buffer */
};
static void set_wave_hdr(wave_hdr& wh, size_t size) {
memcpy(&wh.riff_header, "RIFF", 4);
wh.wav_size = size + sizeof(struct wave_hdr) - 8;
memcpy(&wh.wav_header, "WAVE", 4);
memcpy(&wh.fmt_header, "fmt ", 4);
wh.fmt_chunk_size = 16;
wh.audio_format = 1;
wh.num_channels = 1;
wh.sample_rate = WAVE_SAMPLE_RATE;
wh.sample_alignment = 2;
wh.bit_depth = 16;
wh.byte_rate = wh.sample_rate * wh.sample_alignment;
memcpy(&wh.data_header, "data", 4);
wh.data_bytes = size;
}
static void write_wave_hdr(int fd, size_t size) {
struct wave_hdr wh;
set_wave_hdr(wh, size);
write(fd, &wh, sizeof(struct wave_hdr));
}
static int map_file(int fd, u8 **ptr, size_t *size)
{
struct stat sb;
fstat(fd, &sb);
*size = sb.st_size;
*ptr = (u8*)mmap(NULL, *size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
if (*ptr == MAP_FAILED) {
perror("mmap");
return -1;
}
return 0;
}
static int read_packet(void *opaque, u8 *buf, int buf_size)
{
struct audio_buffer *audio_buf = (audio_buffer*)opaque;
buf_size = FFMIN(buf_size, audio_buf->size);
/* copy internal buffer data to buf */
memcpy(buf, audio_buf->ptr, buf_size);
audio_buf->ptr += buf_size;
audio_buf->size -= buf_size;
return buf_size;
}
static void convert_frame(struct SwrContext *swr, AVCodecContext *codec,
AVFrame *frame, s16 **data, int *size, bool flush)
{
int nr_samples;
s64 delay;
u8 *buffer;
delay = swr_get_delay(swr, codec->sample_rate);
nr_samples = av_rescale_rnd(delay + frame->nb_samples,
WAVE_SAMPLE_RATE, codec->sample_rate,
AV_ROUND_UP);
av_samples_alloc(&buffer, NULL, 1, nr_samples, AV_SAMPLE_FMT_S16, 0);
/*
* !flush is used to check if we are flushing any remaining
* conversion buffers...
*/
nr_samples = swr_convert(swr, &buffer, nr_samples,
!flush ? (const u8 **)frame->data : NULL,
!flush ? frame->nb_samples : 0);
*data = (s16*)realloc(*data, (*size + nr_samples) * sizeof(s16));
memcpy(*data + *size, buffer, nr_samples * sizeof(s16));
*size += nr_samples;
av_freep(&buffer);
}
static bool is_audio_stream(const AVStream *stream)
{
if (stream->codecpar->codec_type == AVMEDIA_TYPE_AUDIO)
return true;
return false;
}
// Return non zero on error, 0 on success
// audio_buffer: input memory
// data: decoded output audio data (wav file)
// size: size of output data
static int decode_audio(struct audio_buffer *audio_buf, s16 **data, int *size)
{
LOG("decode_audio: input size: %d\n", audio_buf->size);
AVFormatContext *fmt_ctx;
AVIOContext *avio_ctx;
AVStream *stream;
AVCodecContext *codec;
AVPacket packet;
AVFrame *frame;
struct SwrContext *swr;
u8 *avio_ctx_buffer;
unsigned int i;
int stream_index = -1;
int err;
const size_t errbuffsize = 1024;
char errbuff[errbuffsize];
fmt_ctx = avformat_alloc_context();
avio_ctx_buffer = (u8*)av_malloc(AVIO_CTX_BUF_SZ);
LOG("Creating an avio context: AVIO_CTX_BUF_SZ=%d\n", AVIO_CTX_BUF_SZ);
avio_ctx = avio_alloc_context(avio_ctx_buffer, AVIO_CTX_BUF_SZ, 0, audio_buf, &read_packet, NULL, NULL);
fmt_ctx->pb = avio_ctx;
// open the input stream and read header
err = avformat_open_input(&fmt_ctx, NULL, NULL, NULL);
if (err) {
LOG("Could not read audio buffer: %d: %s\n", err, av_make_error_string(errbuff, errbuffsize, err));
return err;
}
err = avformat_find_stream_info(fmt_ctx, NULL);
if (err < 0) {
LOG("Could not retrieve stream info from audio buffer: %d\n", err);
return err;
}
for (i = 0; i < fmt_ctx->nb_streams; i++) {
if (is_audio_stream(fmt_ctx->streams[i])) {
stream_index = i;
break;
}
}
if (stream_index == -1) {
LOG("Could not retrieve audio stream from buffer\n");
return -1;
}
stream = fmt_ctx->streams[stream_index];
codec = avcodec_alloc_context3(
avcodec_find_decoder(stream->codecpar->codec_id));
avcodec_parameters_to_context(codec, stream->codecpar);
err = avcodec_open2(codec, avcodec_find_decoder(codec->codec_id),
NULL);
if (err) {
LOG("Failed to open decoder for stream #%d in audio buffer\n", stream_index);
return err;
}
/* prepare resampler */
swr = swr_alloc();
av_opt_set_int(swr, "in_channel_count", codec->channels, 0);
av_opt_set_int(swr, "out_channel_count", 1, 0);
av_opt_set_int(swr, "in_channel_layout", codec->channel_layout, 0);
av_opt_set_int(swr, "out_channel_layout", AV_CH_LAYOUT_MONO, 0);
av_opt_set_int(swr, "in_sample_rate", codec->sample_rate, 0);
av_opt_set_int(swr, "out_sample_rate", WAVE_SAMPLE_RATE, 0);
av_opt_set_sample_fmt(swr, "in_sample_fmt", codec->sample_fmt, 0);
av_opt_set_sample_fmt(swr, "out_sample_fmt", AV_SAMPLE_FMT_S16, 0);
swr_init(swr);
if (!swr_is_initialized(swr)) {
LOG("Resampler has not been properly initialized\n");
return -1;
}
av_init_packet(&packet);
frame = av_frame_alloc();
if (!frame) {
LOG("Error allocating the frame\n");
return -1;
}
/* iterate through frames */
*data = NULL;
*size = 0;
while (av_read_frame(fmt_ctx, &packet) >= 0) {
avcodec_send_packet(codec, &packet);
err = avcodec_receive_frame(codec, frame);
if (err == AVERROR(EAGAIN))
continue;
convert_frame(swr, codec, frame, data, size, false);
}
/* Flush any remaining conversion buffers... */
convert_frame(swr, codec, frame, data, size, true);
av_frame_free(&frame);
swr_free(&swr);
//avio_context_free(); // todo?
avcodec_close(codec);
avformat_close_input(&fmt_ctx);
avformat_free_context(fmt_ctx);
if (avio_ctx) {
av_freep(&avio_ctx->buffer);
av_freep(&avio_ctx);
}
return 0;
}
// in mem decoding/conversion/resampling:
// ifname: input file path
// owav_data: in mem wav file. Can be forwarded as it to whisper/drwav
// return 0 on success
int ffmpeg_decode_audio(const std::string &ifname, std::vector<uint8_t>& owav_data) {
LOG("ffmpeg_decode_audio: %s\n", ifname.c_str());
int ifd = open(ifname.c_str(), O_RDONLY);
if (ifd == -1) {
fprintf(stderr, "Couldn't open input file %s\n", ifname.c_str());
return -1;
}
u8 *ibuf = NULL;
size_t ibuf_size;
int err = map_file(ifd, &ibuf, &ibuf_size);
if (err) {
LOG("Couldn't map input file %s\n", ifname.c_str());
return err;
}
LOG("Mapped input file: %s size: %d\n", ibuf, (int) ibuf_size);
struct audio_buffer inaudio_buf;
inaudio_buf.ptr = ibuf;
inaudio_buf.size = ibuf_size;
s16 *odata=NULL;
int osize=0;
err = decode_audio(&inaudio_buf, &odata, &osize);
LOG("decode_audio returned %d \n", err);
if (err != 0) {
LOG("decode_audio failed\n");
return err;
}
LOG("decode_audio output size: %d\n", osize);
wave_hdr wh;
const size_t outdatasize = osize * sizeof(s16);
set_wave_hdr(wh, outdatasize);
owav_data.resize(sizeof(wave_hdr) + outdatasize);
// header:
memcpy(owav_data.data(), &wh, sizeof(wave_hdr));
// the data:
memcpy(owav_data.data() + sizeof(wave_hdr), odata, osize* sizeof(s16));
return 0;
}

423
examples/grammar-parser.cpp Normal file
View File

@ -0,0 +1,423 @@
#include "grammar-parser.h"
#include <cstdint>
#include <cwchar>
#include <string>
#include <utility>
#include <stdexcept>
#include <exception>
namespace grammar_parser {
// NOTE: assumes valid utf8 (but checks for overrun)
// copied from whisper.cpp
static std::pair<uint32_t, const char *> decode_utf8(const char * src) {
static const int lookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4 };
uint8_t first_byte = static_cast<uint8_t>(*src);
uint8_t highbits = first_byte >> 4;
int len = lookup[highbits];
uint8_t mask = (1 << (8 - len)) - 1;
uint32_t value = first_byte & mask;
const char * end = src + len; // may overrun!
const char * pos = src + 1;
for ( ; pos < end && *pos; pos++) {
value = (value << 6) + (static_cast<uint8_t>(*pos) & 0x3F);
}
return std::make_pair(value, pos);
}
static uint32_t get_symbol_id(parse_state & state, const char * src, size_t len) {
uint32_t next_id = static_cast<uint32_t>(state.symbol_ids.size());
auto result = state.symbol_ids.insert(std::make_pair(std::string(src, len), next_id));
return result.first->second;
}
static uint32_t generate_symbol_id(parse_state & state, const std::string & base_name) {
uint32_t next_id = static_cast<uint32_t>(state.symbol_ids.size());
state.symbol_ids[base_name + '_' + std::to_string(next_id)] = next_id;
return next_id;
}
static void add_rule(
parse_state & state,
uint32_t rule_id,
const std::vector<whisper_grammar_element> & rule) {
if (state.rules.size() <= rule_id) {
state.rules.resize(rule_id + 1);
}
state.rules[rule_id] = rule;
}
static bool is_word_char(char c) {
return ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z') || c == '-' || ('0' <= c && c <= '9');
}
static std::pair<uint32_t, const char *> parse_hex(const char * src, int size) {
const char * pos = src;
const char * end = src + size;
uint32_t value = 0;
for ( ; pos < end && *pos; pos++) {
value <<= 4;
char c = *pos;
if ('a' <= c && c <= 'f') {
value += c - 'a' + 10;
} else if ('A' <= c && c <= 'F') {
value += c - 'A' + 10;
} else if ('0' <= c && c <= '9') {
value += c - '0';
} else {
break;
}
}
if (pos != end) {
throw std::runtime_error("expecting " + std::to_string(size) + " hex chars at " + src);
}
return std::make_pair(value, pos);
}
static const char * parse_space(const char * src, bool newline_ok) {
const char * pos = src;
while (*pos == ' ' || *pos == '\t' || *pos == '#' ||
(newline_ok && (*pos == '\r' || *pos == '\n'))) {
if (*pos == '#') {
while (*pos && *pos != '\r' && *pos != '\n') {
pos++;
}
} else {
pos++;
}
}
return pos;
}
static const char * parse_name(const char * src) {
const char * pos = src;
while (is_word_char(*pos)) {
pos++;
}
if (pos == src) {
throw std::runtime_error(std::string("expecting name at ") + src);
}
return pos;
}
static std::pair<uint32_t, const char *> parse_char(const char * src) {
if (*src == '\\') {
switch (src[1]) {
case 'x': return parse_hex(src + 2, 2);
case 'u': return parse_hex(src + 2, 4);
case 'U': return parse_hex(src + 2, 8);
case 't': return std::make_pair('\t', src + 2);
case 'r': return std::make_pair('\r', src + 2);
case 'n': return std::make_pair('\n', src + 2);
case '\\':
case '"':
case '[':
case ']':
return std::make_pair(src[1], src + 2);
default:
throw std::runtime_error(std::string("unknown escape at ") + src);
}
} else if (*src) {
return decode_utf8(src);
}
throw std::runtime_error("unexpected end of input");
}
static const char * parse_alternates(
parse_state & state,
const char * src,
const std::string & rule_name,
uint32_t rule_id,
bool is_nested);
static const char * parse_sequence(
parse_state & state,
const char * src,
const std::string & rule_name,
std::vector<whisper_grammar_element> & out_elements,
bool is_nested) {
size_t last_sym_start = out_elements.size();
const char * pos = src;
while (*pos) {
if (*pos == '"') { // literal string
pos++;
last_sym_start = out_elements.size();
while (*pos != '"') {
auto char_pair = parse_char(pos);
pos = char_pair.second;
out_elements.push_back({WHISPER_GRETYPE_CHAR, char_pair.first});
}
pos = parse_space(pos + 1, is_nested);
} else if (*pos == '[') { // char range(s)
pos++;
enum whisper_gretype start_type = WHISPER_GRETYPE_CHAR;
if (*pos == '^') {
pos++;
start_type = WHISPER_GRETYPE_CHAR_NOT;
}
last_sym_start = out_elements.size();
while (*pos != ']') {
auto char_pair = parse_char(pos);
pos = char_pair.second;
enum whisper_gretype type = last_sym_start < out_elements.size()
? WHISPER_GRETYPE_CHAR_ALT
: start_type;
out_elements.push_back({type, char_pair.first});
if (pos[0] == '-' && pos[1] != ']') {
auto endchar_pair = parse_char(pos + 1);
pos = endchar_pair.second;
out_elements.push_back({WHISPER_GRETYPE_CHAR_RNG_UPPER, endchar_pair.first});
}
}
pos = parse_space(pos + 1, is_nested);
} else if (is_word_char(*pos)) { // rule reference
const char * name_end = parse_name(pos);
uint32_t ref_rule_id = get_symbol_id(state, pos, name_end - pos);
pos = parse_space(name_end, is_nested);
last_sym_start = out_elements.size();
out_elements.push_back({WHISPER_GRETYPE_RULE_REF, ref_rule_id});
} else if (*pos == '(') { // grouping
// parse nested alternates into synthesized rule
pos = parse_space(pos + 1, true);
uint32_t sub_rule_id = generate_symbol_id(state, rule_name);
pos = parse_alternates(state, pos, rule_name, sub_rule_id, true);
last_sym_start = out_elements.size();
// output reference to synthesized rule
out_elements.push_back({WHISPER_GRETYPE_RULE_REF, sub_rule_id});
if (*pos != ')') {
throw std::runtime_error(std::string("expecting ')' at ") + pos);
}
pos = parse_space(pos + 1, is_nested);
} else if (*pos == '*' || *pos == '+' || *pos == '?') { // repetition operator
if (last_sym_start == out_elements.size()) {
throw std::runtime_error(std::string("expecting preceding item to */+/? at ") + pos);
}
// apply transformation to previous symbol (last_sym_start to end) according to
// rewrite rules:
// S* --> S' ::= S S' |
// S+ --> S' ::= S S' | S
// S? --> S' ::= S |
uint32_t sub_rule_id = generate_symbol_id(state, rule_name);
std::vector<whisper_grammar_element> sub_rule;
// add preceding symbol to generated rule
sub_rule.insert(
sub_rule.end(), out_elements.begin() + last_sym_start, out_elements.end());
if (*pos == '*' || *pos == '+') {
// cause generated rule to recurse
sub_rule.push_back({WHISPER_GRETYPE_RULE_REF, sub_rule_id});
}
// mark start of alternate def
sub_rule.push_back({WHISPER_GRETYPE_ALT, 0});
if (*pos == '+') {
// add preceding symbol as alternate only for '+' (otherwise empty)
sub_rule.insert(
sub_rule.end(), out_elements.begin() + last_sym_start, out_elements.end());
}
sub_rule.push_back({WHISPER_GRETYPE_END, 0});
add_rule(state, sub_rule_id, sub_rule);
// in original rule, replace previous symbol with reference to generated rule
out_elements.resize(last_sym_start);
out_elements.push_back({WHISPER_GRETYPE_RULE_REF, sub_rule_id});
pos = parse_space(pos + 1, is_nested);
} else {
break;
}
}
return pos;
}
static const char * parse_alternates(
parse_state & state,
const char * src,
const std::string & rule_name,
uint32_t rule_id,
bool is_nested) {
std::vector<whisper_grammar_element> rule;
const char * pos = parse_sequence(state, src, rule_name, rule, is_nested);
while (*pos == '|') {
rule.push_back({WHISPER_GRETYPE_ALT, 0});
pos = parse_space(pos + 1, true);
pos = parse_sequence(state, pos, rule_name, rule, is_nested);
}
rule.push_back({WHISPER_GRETYPE_END, 0});
add_rule(state, rule_id, rule);
return pos;
}
static const char * parse_rule(parse_state & state, const char * src) {
const char * name_end = parse_name(src);
const char * pos = parse_space(name_end, false);
size_t name_len = name_end - src;
uint32_t rule_id = get_symbol_id(state, src, name_len);
const std::string name(src, name_len);
if (!(pos[0] == ':' && pos[1] == ':' && pos[2] == '=')) {
throw std::runtime_error(std::string("expecting ::= at ") + pos);
}
pos = parse_space(pos + 3, true);
pos = parse_alternates(state, pos, name, rule_id, false);
if (*pos == '\r') {
pos += pos[1] == '\n' ? 2 : 1;
} else if (*pos == '\n') {
pos++;
} else if (*pos) {
throw std::runtime_error(std::string("expecting newline or end at ") + pos);
}
return parse_space(pos, true);
}
parse_state parse(const char * src) {
try {
parse_state state;
const char * pos = parse_space(src, true);
while (*pos) {
pos = parse_rule(state, pos);
}
return state;
} catch (const std::exception & err) {
fprintf(stderr, "%s: error parsing grammar: %s\n", __func__, err.what());
return parse_state();
}
}
static void print_grammar_char(FILE * file, uint32_t c) {
if (0x20 <= c && c <= 0x7f) {
fprintf(file, "%c", static_cast<char>(c));
} else {
// cop out of encoding UTF-8
fprintf(file, "<U+%04X>", c);
}
}
static bool is_char_element(whisper_grammar_element elem) {
switch (elem.type) {
case WHISPER_GRETYPE_CHAR: return true;
case WHISPER_GRETYPE_CHAR_NOT: return true;
case WHISPER_GRETYPE_CHAR_ALT: return true;
case WHISPER_GRETYPE_CHAR_RNG_UPPER: return true;
default: return false;
}
}
static void print_rule_binary(FILE * file, const std::vector<whisper_grammar_element> & rule) {
for (auto elem : rule) {
switch (elem.type) {
case WHISPER_GRETYPE_END: fprintf(file, "END"); break;
case WHISPER_GRETYPE_ALT: fprintf(file, "ALT"); break;
case WHISPER_GRETYPE_RULE_REF: fprintf(file, "RULE_REF"); break;
case WHISPER_GRETYPE_CHAR: fprintf(file, "CHAR"); break;
case WHISPER_GRETYPE_CHAR_NOT: fprintf(file, "CHAR_NOT"); break;
case WHISPER_GRETYPE_CHAR_RNG_UPPER: fprintf(file, "CHAR_RNG_UPPER"); break;
case WHISPER_GRETYPE_CHAR_ALT: fprintf(file, "CHAR_ALT"); break;
}
switch (elem.type) {
case WHISPER_GRETYPE_END:
case WHISPER_GRETYPE_ALT:
case WHISPER_GRETYPE_RULE_REF:
fprintf(file, "(%u) ", elem.value);
break;
case WHISPER_GRETYPE_CHAR:
case WHISPER_GRETYPE_CHAR_NOT:
case WHISPER_GRETYPE_CHAR_RNG_UPPER:
case WHISPER_GRETYPE_CHAR_ALT:
fprintf(file, "(\"");
print_grammar_char(file, elem.value);
fprintf(file, "\") ");
break;
}
}
fprintf(file, "\n");
}
static void print_rule(
FILE * file,
uint32_t rule_id,
const std::vector<whisper_grammar_element> & rule,
const std::map<uint32_t, std::string> & symbol_id_names) {
if (rule.empty() || rule.back().type != WHISPER_GRETYPE_END) {
throw std::runtime_error(
"malformed rule, does not end with WHISPER_GRETYPE_END: " + std::to_string(rule_id));
}
fprintf(file, "%s ::= ", symbol_id_names.at(rule_id).c_str());
for (size_t i = 0, end = rule.size() - 1; i < end; i++) {
whisper_grammar_element elem = rule[i];
switch (elem.type) {
case WHISPER_GRETYPE_END:
throw std::runtime_error(
"unexpected end of rule: " + std::to_string(rule_id) + "," +
std::to_string(i));
case WHISPER_GRETYPE_ALT:
fprintf(file, "| ");
break;
case WHISPER_GRETYPE_RULE_REF:
fprintf(file, "%s ", symbol_id_names.at(elem.value).c_str());
break;
case WHISPER_GRETYPE_CHAR:
fprintf(file, "[");
print_grammar_char(file, elem.value);
break;
case WHISPER_GRETYPE_CHAR_NOT:
fprintf(file, "[^");
print_grammar_char(file, elem.value);
break;
case WHISPER_GRETYPE_CHAR_RNG_UPPER:
if (i == 0 || !is_char_element(rule[i - 1])) {
throw std::runtime_error(
"WHISPER_GRETYPE_CHAR_RNG_UPPER without preceding char: " +
std::to_string(rule_id) + "," + std::to_string(i));
}
fprintf(file, "-");
print_grammar_char(file, elem.value);
break;
case WHISPER_GRETYPE_CHAR_ALT:
if (i == 0 || !is_char_element(rule[i - 1])) {
throw std::runtime_error(
"WHISPER_GRETYPE_CHAR_ALT without preceding char: " +
std::to_string(rule_id) + "," + std::to_string(i));
}
print_grammar_char(file, elem.value);
break;
}
if (is_char_element(elem)) {
switch (rule[i + 1].type) {
case WHISPER_GRETYPE_CHAR_ALT:
case WHISPER_GRETYPE_CHAR_RNG_UPPER:
break;
default:
fprintf(file, "] ");
}
}
}
fprintf(file, "\n");
}
void print_grammar(FILE * file, const parse_state & state) {
try {
std::map<uint32_t, std::string> symbol_id_names;
for (auto kv : state.symbol_ids) {
symbol_id_names[kv.second] = kv.first;
}
for (size_t i = 0, end = state.rules.size(); i < end; i++) {
// fprintf(file, "%zu: ", i);
// print_rule_binary(file, state.rules[i]);
print_rule(file, uint32_t(i), state.rules[i], symbol_id_names);
// fprintf(file, "\n");
}
} catch (const std::exception & err) {
fprintf(stderr, "\n%s: error printing grammar: %s\n", __func__, err.what());
}
}
std::vector<const whisper_grammar_element *> parse_state::c_rules() const {
std::vector<const whisper_grammar_element *> ret;
for (const auto & rule : rules) {
ret.push_back(rule.data());
}
return ret;
}
}

29
examples/grammar-parser.h Normal file
View File

@ -0,0 +1,29 @@
// Implements a parser for an extended Backus-Naur form (BNF), producing the
// binary context-free grammar format specified by whisper.h. Supports character
// ranges, grouping, and repetition operators. As an example, a grammar for
// arithmetic might look like:
//
// root ::= expr
// expr ::= term ([-+*/] term)*
// term ::= num | "(" space expr ")" space
// num ::= [0-9]+ space
// space ::= [ \t\n]*
#pragma once
#include "whisper.h"
#include <vector>
#include <map>
#include <cstdint>
#include <string>
namespace grammar_parser {
struct parse_state {
std::map<std::string, uint32_t> symbol_ids;
std::vector<std::vector<whisper_grammar_element>> rules;
std::vector<const whisper_grammar_element *> c_rules() const;
};
parse_state parse(const char * src);
void print_grammar(FILE * file, const parse_state & state);
}

View File

@ -22,6 +22,7 @@ var printTextarea = (function() {
async function clearCache() {
if (confirm('Are you sure you want to clear the cache?\nAll the models will be downloaded again.')) {
indexedDB.deleteDatabase(dbName);
location.reload();
}
}
@ -33,9 +34,6 @@ async function fetchRemote(url, cbProgress, cbPrint) {
url,
{
method: 'GET',
headers: {
'Content-Type': 'application/octet-stream',
},
}
);

View File

@ -48,7 +48,7 @@ if [ -n "$3" ]; then
fi
# Whisper models
models=( "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large-v1" "large-v2" "large" )
models=( "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large-v1" "large-v2" "large-v3" "large-v3-turbo" )
# list available models
function list_models {

View File

@ -5,5 +5,5 @@ if (WHISPER_SDL2)
include(DefaultTargetOptions)
target_link_libraries(${TARGET} PRIVATE common common-sdl whisper ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE common json_cpp common-sdl whisper ${CMAKE_THREAD_LIBS_INIT})
endif ()

View File

@ -26,11 +26,11 @@ struct whisper_params {
float vad_thold = 0.6f;
float freq_thold = 100.0f;
bool speed_up = false;
bool translate = false;
bool print_special = false;
bool print_energy = false;
bool use_gpu = true;
bool flash_attn = false;
std::string language = "en";
std::string model = "models/ggml-base.en.bin";
@ -53,7 +53,7 @@ struct commandset {
void whisper_print_usage(int argc, char ** argv, const whisper_params & params);
bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
static bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
for (int i = 1; i < argc; i++) {
std::string arg = argv[i];
@ -69,11 +69,11 @@ bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
else if (arg == "-ac" || arg == "--audio-ctx") { params.audio_ctx = std::stoi(argv[++i]); }
else if (arg == "-vth" || arg == "--vad-thold") { params.vad_thold = std::stof(argv[++i]); }
else if (arg == "-fth" || arg == "--freq-thold") { params.freq_thold = std::stof(argv[++i]); }
else if (arg == "-su" || arg == "--speed-up") { params.speed_up = true; }
else if (arg == "-tr" || arg == "--translate") { params.translate = true; }
else if (arg == "-ps" || arg == "--print-special") { params.print_special = true; }
else if (arg == "-pe" || arg == "--print-energy") { params.print_energy = true; }
else if (arg == "-ng" || arg == "--no-gpu") { params.use_gpu = false; }
else if (arg == "-fa" || arg == "--flash-attn") { params.flash_attn = true; }
else if (arg == "-l" || arg == "--language") { params.language = argv[++i]; }
else if (arg == "-m" || arg == "--model") { params.model = argv[++i]; }
else {
@ -100,16 +100,16 @@ void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params & para
fprintf(stderr, " -ac N, --audio-ctx N [%-7d] audio context size (0 - all)\n", params.audio_ctx);
fprintf(stderr, " -vth N, --vad-thold N [%-7.2f] voice activity detection threshold\n", params.vad_thold);
fprintf(stderr, " -fth N, --freq-thold N [%-7.2f] high-pass frequency cutoff\n", params.freq_thold);
fprintf(stderr, " -su, --speed-up [%-7s] speed up audio by x2 (reduced accuracy)\n", params.speed_up ? "true" : "false");
fprintf(stderr, " -tr, --translate [%-7s] translate from source language to english\n", params.translate ? "true" : "false");
fprintf(stderr, " -ps, --print-special [%-7s] print special tokens\n", params.print_special ? "true" : "false");
fprintf(stderr, " -pe, --print-energy [%-7s] print sound energy (for debugging)\n", params.print_energy ? "true" : "false");
fprintf(stderr, " -ng, --no-gpu [%-7s] disable GPU\n", params.use_gpu ? "false" : "true");
fprintf(stderr, " -fa, --flash-attn [%-7s] flash attention\n", params.flash_attn ? "true" : "false");
fprintf(stderr, " -l LANG, --language LANG [%-7s] spoken language\n", params.language.c_str());
fprintf(stderr, " -m FNAME, --model FNAME [%-7s] model path\n", params.model.c_str());
fprintf(stderr, "\n");
}
uint64_t wait_for_vad(audio_async & audio, json jparams, const whisper_params & params, uint64_t maxlength_ms, std::vector<float> & pcmf32) {
static uint64_t wait_for_vad(audio_async & audio, json jparams, const whisper_params & params, uint64_t maxlength_ms, std::vector<float> & pcmf32) {
using namespace std::chrono;
uint64_t time_now = time_point_cast<milliseconds>(system_clock::now()).time_since_epoch().count();
uint64_t start_time = time_now;
@ -153,7 +153,7 @@ uint64_t wait_for_vad(audio_async & audio, json jparams, const whisper_params &
return time_now;
}
json unguided_transcription(struct whisper_context * ctx, audio_async &audio, json jparams, const whisper_params &params) {
static json unguided_transcription(struct whisper_context * ctx, audio_async &audio, json jparams, const whisper_params &params) {
std::vector<whisper_token> prompt_tokens;
std::vector<float> pcmf32;
uint64_t unprocessed_audio_timestamp = wait_for_vad(audio, jparams, params, 10000U, pcmf32);
@ -181,7 +181,6 @@ json unguided_transcription(struct whisper_context * ctx, audio_async &audio, js
wparams.n_threads = params.n_threads;
wparams.audio_ctx = params.audio_ctx;
wparams.speed_up = params.speed_up;
wparams.suppress_non_speech_tokens = true;
// run the transformer and a single decoding pass
if (whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size()) != 0) {
@ -200,7 +199,7 @@ json unguided_transcription(struct whisper_context * ctx, audio_async &audio, js
// command-list mode
// guide the transcription to match the most likely command from a provided list
json guided_transcription(struct whisper_context * ctx, audio_async &audio, const whisper_params &params, json jparams, std::vector<struct commandset> commandset_list) {
static json guided_transcription(struct whisper_context * ctx, audio_async &audio, const whisper_params &params, json jparams, std::vector<struct commandset> commandset_list) {
struct commandset cs = commandset_list[jparams.value("commandset_index", commandset_list.size()-1)];
std::vector<float> pcmf32;
uint64_t unprocessed_audio_timestamp = wait_for_vad(audio, jparams, params, 2000U, pcmf32);
@ -220,7 +219,6 @@ json guided_transcription(struct whisper_context * ctx, audio_async &audio, cons
wparams.n_threads = params.n_threads;
wparams.audio_ctx = params.audio_ctx;
wparams.speed_up = params.speed_up;
// TODO: Do some time testing. Does an overly long prompt slow down processing?
// Set up command sets/precompute prompts
@ -287,7 +285,7 @@ json guided_transcription(struct whisper_context * ctx, audio_async &audio, cons
}
}
json register_commandset(struct whisper_context * ctx, json jparams, std::vector<struct commandset> &commandset_list) {
static json register_commandset(struct whisper_context * ctx, json jparams, std::vector<struct commandset> &commandset_list) {
// TODO: check for token collision
struct commandset cs;
@ -327,7 +325,8 @@ json register_commandset(struct whisper_context * ctx, json jparams, std::vector
commandset_list.push_back(cs);
return json{{"index",index}};
}
json seek(struct whisper_context * /*ctx*/, audio_async & /*audio*/, json /*params*/) {
static json seek(struct whisper_context * /*ctx*/, audio_async & /*audio*/, json /*params*/) {
// whisper_state has the pertinent offsets, but there also seem to be a large
// number of scratch buffers that would prevent rewinding context in a manner similar to llama
// I'll give this a another pass once everything else is implemented,
@ -337,7 +336,8 @@ json seek(struct whisper_context * /*ctx*/, audio_async & /*audio*/, json /*para
{"message", "Seeking is not yet supported."}
};
}
json parse_job(const json &body, struct whisper_context * ctx, audio_async &audio, const whisper_params &params, std::vector<struct commandset> &commandset_list) {
static json parse_job(const json &body, struct whisper_context * ctx, audio_async &audio, const whisper_params &params, std::vector<struct commandset> &commandset_list) {
// See: https://www.jsonrpc.org/specification
json id = body.at("id");
try {
@ -377,7 +377,7 @@ json parse_job(const json &body, struct whisper_context * ctx, audio_async &audi
}
}
void process_loop(struct whisper_context * ctx, audio_async &audio, const whisper_params &params) {
static void process_loop(struct whisper_context * ctx, audio_async &audio, const whisper_params &params) {
std::deque<json> jobqueue;
std::vector<struct commandset> commandset_list;
while (true) {
@ -435,8 +435,11 @@ int main(int argc, char ** argv) {
}
// whisper init
struct whisper_context_params cparams;
cparams.use_gpu = params.use_gpu;
struct whisper_context_params cparams = whisper_context_default_params();
cparams.use_gpu = params.use_gpu;
cparams.flash_attn = params.flash_attn;
struct whisper_context * ctx = whisper_init_from_file_with_params(params.model.c_str(), cparams);
// init audio

View File

@ -3,4 +3,4 @@ add_executable(${TARGET} main.cpp)
include(DefaultTargetOptions)
target_link_libraries(${TARGET} PRIVATE common whisper ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE common whisper ${FFMPEG_LIBRARIES} ${CMAKE_THREAD_LIBS_INIT})

View File

@ -17,28 +17,37 @@ options:
-d N, --duration N [0 ] duration of audio to process in milliseconds
-mc N, --max-context N [-1 ] maximum number of text context tokens to store
-ml N, --max-len N [0 ] maximum segment length in characters
-sow, --split-on-word [false ] split on word rather than on token
-bo N, --best-of N [5 ] number of best candidates to keep
-bs N, --beam-size N [-1 ] beam size for beam search
-bs N, --beam-size N [5 ] beam size for beam search
-wt N, --word-thold N [0.01 ] word timestamp probability threshold
-et N, --entropy-thold N [2.40 ] entropy threshold for decoder fail
-lpt N, --logprob-thold N [-1.00 ] log probability threshold for decoder fail
-su, --speed-up [false ] speed up audio by x2 (reduced accuracy)
-debug, --debug-mode [false ] enable debug mode (eg. dump log_mel)
-tr, --translate [false ] translate from source language to english
-di, --diarize [false ] stereo audio diarization
-tdrz, --tinydiarize [false ] enable tinydiarize (requires a tdrz model)
-nf, --no-fallback [false ] do not use temperature fallback while decoding
-otxt, --output-txt [false ] output result in a text file
-ovtt, --output-vtt [false ] output result in a vtt file
-osrt, --output-srt [false ] output result in a srt file
-olrc, --output-lrc [false ] output result in a lrc file
-owts, --output-words [false ] output script for generating karaoke video
-fp, --font-path [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
-ocsv, --output-csv [false ] output result in a CSV file
-oj, --output-json [false ] output result in a JSON file
-ojf, --output-json-full [false ] include more information in the JSON file
-of FNAME, --output-file FNAME [ ] output file path (without file extension)
-ps, --print-special [false ] print special tokens
-pc, --print-colors [false ] print colors
-pp, --print-progress [false ] print progress
-nt, --no-timestamps [true ] do not print timestamps
-nt, --no-timestamps [false ] do not print timestamps
-l LANG, --language LANG [en ] spoken language ('auto' for auto-detect)
-dl, --detect-language [false ] exit after automatically detecting language
--prompt PROMPT [ ] initial prompt
-m FNAME, --model FNAME [models/ggml-base.en.bin] model path
-f FNAME, --file FNAME [ ] input WAV file path
-oved D, --ov-e-device DNAME [CPU ] the OpenVINO device used for encode inference
-ls, --log-score [false ] log best decoder scores of tokens
-ng, --no-gpu [false ] disable GPU
```

View File

@ -1,10 +1,12 @@
#include "common.h"
#include "whisper.h"
#include "grammar-parser.h"
#include <cmath>
#include <fstream>
#include <cstdio>
#include <regex>
#include <string>
#include <thread>
#include <vector>
@ -14,36 +16,8 @@
#pragma warning(disable: 4244 4267) // possible loss of data
#endif
// Terminal color map. 10 colors grouped in ranges [0.0, 0.1, ..., 0.9]
// Lowest is red, middle is yellow, highest is green.
const std::vector<std::string> k_colors = {
"\033[38;5;196m", "\033[38;5;202m", "\033[38;5;208m", "\033[38;5;214m", "\033[38;5;220m",
"\033[38;5;226m", "\033[38;5;190m", "\033[38;5;154m", "\033[38;5;118m", "\033[38;5;82m",
};
// 500 -> 00:05.000
// 6000 -> 01:00.000
std::string to_timestamp(int64_t t, bool comma = false) {
int64_t msec = t * 10;
int64_t hr = msec / (1000 * 60 * 60);
msec = msec - hr * (1000 * 60 * 60);
int64_t min = msec / (1000 * 60);
msec = msec - min * (1000 * 60);
int64_t sec = msec / 1000;
msec = msec - sec * 1000;
char buf[32];
snprintf(buf, sizeof(buf), "%02d:%02d:%02d%s%03d", (int) hr, (int) min, (int) sec, comma ? "," : ".", (int) msec);
return std::string(buf);
}
int timestamp_to_sample(int64_t t, int n_samples) {
return std::max(0, std::min((int) n_samples - 1, (int) ((t*WHISPER_SAMPLE_RATE)/100)));
}
// helper function to replace substrings
void replace_all(std::string & s, const std::string & search, const std::string & replace) {
static void replace_all(std::string & s, const std::string & search, const std::string & replace) {
for (size_t pos = 0; ; pos += replace.length()) {
pos = s.find(search, pos);
if (pos == std::string::npos) break;
@ -54,22 +28,25 @@ void replace_all(std::string & s, const std::string & search, const std::string
// command-line parameters
struct whisper_params {
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
int32_t n_processors = 1;
int32_t offset_t_ms = 0;
int32_t offset_n = 0;
int32_t duration_ms = 0;
int32_t progress_step = 5;
int32_t max_context = -1;
int32_t max_len = 0;
int32_t best_of = 2;
int32_t beam_size = -1;
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
int32_t n_processors = 1;
int32_t offset_t_ms = 0;
int32_t offset_n = 0;
int32_t duration_ms = 0;
int32_t progress_step = 5;
int32_t max_context = -1;
int32_t max_len = 0;
int32_t best_of = whisper_full_default_params(WHISPER_SAMPLING_GREEDY).greedy.best_of;
int32_t beam_size = whisper_full_default_params(WHISPER_SAMPLING_BEAM_SEARCH).beam_search.beam_size;
int32_t audio_ctx = 0;
float word_thold = 0.01f;
float entropy_thold = 2.40f;
float logprob_thold = -1.00f;
float word_thold = 0.01f;
float entropy_thold = 2.40f;
float logprob_thold = -1.00f;
float grammar_penalty = 100.0f;
float temperature = 0.0f;
float temperature_inc = 0.2f;
bool speed_up = false;
bool debug_mode = false;
bool translate = false;
bool detect_language = false;
@ -85,30 +62,49 @@ struct whisper_params {
bool output_jsn = false;
bool output_jsn_full = false;
bool output_lrc = false;
bool no_prints = false;
bool print_special = false;
bool print_colors = false;
bool print_progress = false;
bool no_timestamps = false;
bool log_score = false;
bool use_gpu = true;
bool flash_attn = false;
std::string language = "en";
std::string prompt;
std::string font_path = "/System/Library/Fonts/Supplemental/Courier New Bold.ttf";
std::string model = "models/ggml-base.en.bin";
std::string grammar;
std::string grammar_rule;
// [TDRZ] speaker turn string
std::string tdrz_speaker_turn = " [SPEAKER_TURN]"; // TODO: set from command line
// A regular expression that matches tokens to suppress
std::string suppress_regex;
std::string openvino_encode_device = "CPU";
std::string dtw = "";
std::vector<std::string> fname_inp = {};
std::vector<std::string> fname_out = {};
grammar_parser::parse_state grammar_parsed;
};
void whisper_print_usage(int argc, char ** argv, const whisper_params & params);
static void whisper_print_usage(int argc, char ** argv, const whisper_params & params);
bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
static char * whisper_param_turn_lowercase(char * in){
int string_len = strlen(in);
for (int i = 0; i < string_len; i++){
*(in+i) = tolower((unsigned char)*(in+i));
}
return in;
}
static bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
for (int i = 1; i < argc; i++) {
std::string arg = argv[i];
@ -135,10 +131,12 @@ bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
else if (arg == "-ml" || arg == "--max-len") { params.max_len = std::stoi(argv[++i]); }
else if (arg == "-bo" || arg == "--best-of") { params.best_of = std::stoi(argv[++i]); }
else if (arg == "-bs" || arg == "--beam-size") { params.beam_size = std::stoi(argv[++i]); }
else if (arg == "-ac" || arg == "--audio-ctx") { params.audio_ctx = std::stoi(argv[++i]); }
else if (arg == "-wt" || arg == "--word-thold") { params.word_thold = std::stof(argv[++i]); }
else if (arg == "-et" || arg == "--entropy-thold") { params.entropy_thold = std::stof(argv[++i]); }
else if (arg == "-lpt" || arg == "--logprob-thold") { params.logprob_thold = std::stof(argv[++i]); }
// else if (arg == "-su" || arg == "--speed-up") { params.speed_up = true; }
else if (arg == "-tp" || arg == "--temperature") { params.temperature = std::stof(argv[++i]); }
else if (arg == "-tpi" || arg == "--temperature-inc") { params.temperature_inc = std::stof(argv[++i]); }
else if (arg == "-debug"|| arg == "--debug-mode") { params.debug_mode = true; }
else if (arg == "-tr" || arg == "--translate") { params.translate = true; }
else if (arg == "-di" || arg == "--diarize") { params.diarize = true; }
@ -155,18 +153,25 @@ bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
else if (arg == "-oj" || arg == "--output-json") { params.output_jsn = true; }
else if (arg == "-ojf" || arg == "--output-json-full"){ params.output_jsn_full = params.output_jsn = true; }
else if (arg == "-of" || arg == "--output-file") { params.fname_out.emplace_back(argv[++i]); }
else if (arg == "-np" || arg == "--no-prints") { params.no_prints = true; }
else if (arg == "-ps" || arg == "--print-special") { params.print_special = true; }
else if (arg == "-pc" || arg == "--print-colors") { params.print_colors = true; }
else if (arg == "-pp" || arg == "--print-progress") { params.print_progress = true; }
else if (arg == "-nt" || arg == "--no-timestamps") { params.no_timestamps = true; }
else if (arg == "-l" || arg == "--language") { params.language = argv[++i]; }
else if (arg == "-l" || arg == "--language") { params.language = whisper_param_turn_lowercase(argv[++i]); }
else if (arg == "-dl" || arg == "--detect-language") { params.detect_language = true; }
else if ( arg == "--prompt") { params.prompt = argv[++i]; }
else if (arg == "-m" || arg == "--model") { params.model = argv[++i]; }
else if (arg == "-f" || arg == "--file") { params.fname_inp.emplace_back(argv[++i]); }
else if (arg == "-oved" || arg == "--ov-e-device") { params.openvino_encode_device = argv[++i]; }
else if (arg == "-ls" || arg == "--log-score") { params.log_score = true; }
else if (arg == "-ng" || arg == "--no-gpu") { params.use_gpu = false; }
else if (arg == "-dtw" || arg == "--dtw") { params.dtw = argv[++i]; }
else if (arg == "-ls" || arg == "--log-score") { params.log_score = true; }
else if (arg == "-ng" || arg == "--no-gpu") { params.use_gpu = false; }
else if (arg == "-fa" || arg == "--flash-attn") { params.flash_attn = true; }
else if ( arg == "--suppress-regex") { params.suppress_regex = argv[++i]; }
else if ( arg == "--grammar") { params.grammar = argv[++i]; }
else if ( arg == "--grammar-rule") { params.grammar_rule = argv[++i]; }
else if ( arg == "--grammar-penalty") { params.grammar_penalty = std::stof(argv[++i]); }
else {
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
whisper_print_usage(argc, argv, params);
@ -177,7 +182,7 @@ bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
return true;
}
void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params & params) {
static void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params & params) {
fprintf(stderr, "\n");
fprintf(stderr, "usage: %s [options] file0.wav file1.wav ...\n", argv[0]);
fprintf(stderr, "\n");
@ -193,10 +198,12 @@ void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params & para
fprintf(stderr, " -sow, --split-on-word [%-7s] split on word rather than on token\n", params.split_on_word ? "true" : "false");
fprintf(stderr, " -bo N, --best-of N [%-7d] number of best candidates to keep\n", params.best_of);
fprintf(stderr, " -bs N, --beam-size N [%-7d] beam size for beam search\n", params.beam_size);
fprintf(stderr, " -ac N, --audio-ctx N [%-7d] audio context size (0 - all)\n", params.audio_ctx);
fprintf(stderr, " -wt N, --word-thold N [%-7.2f] word timestamp probability threshold\n", params.word_thold);
fprintf(stderr, " -et N, --entropy-thold N [%-7.2f] entropy threshold for decoder fail\n", params.entropy_thold);
fprintf(stderr, " -lpt N, --logprob-thold N [%-7.2f] log probability threshold for decoder fail\n", params.logprob_thold);
// fprintf(stderr, " -su, --speed-up [%-7s] speed up audio by x2 (reduced accuracy)\n", params.speed_up ? "true" : "false");
fprintf(stderr, " -tp, --temperature N [%-7.2f] The sampling temperature, between 0 and 1\n", params.temperature);
fprintf(stderr, " -tpi, --temperature-inc N [%-7.2f] The increment of temperature, between 0 and 1\n",params.temperature_inc);
fprintf(stderr, " -debug, --debug-mode [%-7s] enable debug mode (eg. dump log_mel)\n", params.debug_mode ? "true" : "false");
fprintf(stderr, " -tr, --translate [%-7s] translate from source language to english\n", params.translate ? "true" : "false");
fprintf(stderr, " -di, --diarize [%-7s] stereo audio diarization\n", params.diarize ? "true" : "false");
@ -212,18 +219,25 @@ void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params & para
fprintf(stderr, " -oj, --output-json [%-7s] output result in a JSON file\n", params.output_jsn ? "true" : "false");
fprintf(stderr, " -ojf, --output-json-full [%-7s] include more information in the JSON file\n", params.output_jsn_full ? "true" : "false");
fprintf(stderr, " -of FNAME, --output-file FNAME [%-7s] output file path (without file extension)\n", "");
fprintf(stderr, " -np, --no-prints [%-7s] do not print anything other than the results\n", params.no_prints ? "true" : "false");
fprintf(stderr, " -ps, --print-special [%-7s] print special tokens\n", params.print_special ? "true" : "false");
fprintf(stderr, " -pc, --print-colors [%-7s] print colors\n", params.print_colors ? "true" : "false");
fprintf(stderr, " -pp, --print-progress [%-7s] print progress\n", params.print_progress ? "true" : "false");
fprintf(stderr, " -nt, --no-timestamps [%-7s] do not print timestamps\n", params.no_timestamps ? "true" : "false");
fprintf(stderr, " -l LANG, --language LANG [%-7s] spoken language ('auto' for auto-detect)\n", params.language.c_str());
fprintf(stderr, " -dl, --detect-language [%-7s] exit after automatically detecting language\n", params.detect_language ? "true" : "false");
fprintf(stderr, " --prompt PROMPT [%-7s] initial prompt\n", params.prompt.c_str());
fprintf(stderr, " --prompt PROMPT [%-7s] initial prompt (max n_text_ctx/2 tokens)\n", params.prompt.c_str());
fprintf(stderr, " -m FNAME, --model FNAME [%-7s] model path\n", params.model.c_str());
fprintf(stderr, " -f FNAME, --file FNAME [%-7s] input WAV file path\n", "");
fprintf(stderr, " -oved D, --ov-e-device DNAME [%-7s] the OpenVINO device used for encode inference\n", params.openvino_encode_device.c_str());
fprintf(stderr, " -dtw MODEL --dtw MODEL [%-7s] compute token-level timestamps\n", params.dtw.c_str());
fprintf(stderr, " -ls, --log-score [%-7s] log best decoder scores of tokens\n", params.log_score?"true":"false");
fprintf(stderr, " -ng, --no-gpu [%-7s] disable GPU\n", params.use_gpu ? "false" : "true");
fprintf(stderr, " -fa, --flash-attn [%-7s] flash attention\n", params.flash_attn ? "true" : "false");
fprintf(stderr, " --suppress-regex REGEX [%-7s] regular expression matching tokens to suppress\n", params.suppress_regex.c_str());
fprintf(stderr, " --grammar GRAMMAR [%-7s] GBNF grammar to guide decoding\n", params.grammar.c_str());
fprintf(stderr, " --grammar-rule RULE [%-7s] top-level GBNF grammar rule name\n", params.grammar_rule.c_str());
fprintf(stderr, " --grammar-penalty N [%-7.1f] scales down logits of nongrammar tokens\n", params.grammar_penalty);
fprintf(stderr, "\n");
}
@ -234,12 +248,12 @@ struct whisper_print_user_data {
int progress_prev;
};
std::string estimate_diarization_speaker(std::vector<std::vector<float>> pcmf32s, int64_t t0, int64_t t1, bool id_only = false) {
static std::string estimate_diarization_speaker(std::vector<std::vector<float>> pcmf32s, int64_t t0, int64_t t1, bool id_only = false) {
std::string speaker = "";
const int64_t n_samples = pcmf32s[0].size();
const int64_t is0 = timestamp_to_sample(t0, n_samples);
const int64_t is1 = timestamp_to_sample(t1, n_samples);
const int64_t is0 = timestamp_to_sample(t0, n_samples, WHISPER_SAMPLE_RATE);
const int64_t is1 = timestamp_to_sample(t1, n_samples, WHISPER_SAMPLE_RATE);
double energy0 = 0.0f;
double energy1 = 0.0f;
@ -266,7 +280,8 @@ std::string estimate_diarization_speaker(std::vector<std::vector<float>> pcmf32s
return speaker;
}
void whisper_print_progress_callback(struct whisper_context * /*ctx*/, struct whisper_state * /*state*/, int progress, void * user_data) {
static void whisper_print_progress_callback(struct whisper_context * /*ctx*/, struct whisper_state * /*state*/, int progress, void * user_data) {
int progress_step = ((whisper_print_user_data *) user_data)->params->progress_step;
int * progress_prev = &(((whisper_print_user_data *) user_data)->progress_prev);
if (progress >= *progress_prev + progress_step) {
@ -275,7 +290,7 @@ void whisper_print_progress_callback(struct whisper_context * /*ctx*/, struct wh
}
}
void whisper_print_segment_callback(struct whisper_context * ctx, struct whisper_state * /*state*/, int n_new, void * user_data) {
static void whisper_print_segment_callback(struct whisper_context * ctx, struct whisper_state * /*state*/, int n_new, void * user_data) {
const auto & params = *((whisper_print_user_data *) user_data)->params;
const auto & pcmf32s = *((whisper_print_user_data *) user_data)->pcmf32s;
@ -344,7 +359,7 @@ void whisper_print_segment_callback(struct whisper_context * ctx, struct whisper
}
}
bool output_txt(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
static bool output_txt(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
std::ofstream fout(fname);
if (!fout.is_open()) {
fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname);
@ -371,7 +386,7 @@ bool output_txt(struct whisper_context * ctx, const char * fname, const whisper_
return true;
}
bool output_vtt(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
static bool output_vtt(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
std::ofstream fout(fname);
if (!fout.is_open()) {
fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname);
@ -403,7 +418,7 @@ bool output_vtt(struct whisper_context * ctx, const char * fname, const whisper_
return true;
}
bool output_srt(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
static bool output_srt(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
std::ofstream fout(fname);
if (!fout.is_open()) {
fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname);
@ -432,7 +447,7 @@ bool output_srt(struct whisper_context * ctx, const char * fname, const whisper_
return true;
}
char *escape_double_quotes_and_backslashes(const char *str) {
static char * escape_double_quotes_and_backslashes(const char * str) {
if (str == NULL) {
return NULL;
}
@ -445,7 +460,7 @@ char *escape_double_quotes_and_backslashes(const char *str) {
}
}
char *escaped = (char *)calloc(escaped_length, 1); // pre-zeroed
char * escaped = (char *)calloc(escaped_length, 1); // pre-zeroed
if (escaped == NULL) {
return NULL;
}
@ -463,7 +478,39 @@ char *escape_double_quotes_and_backslashes(const char *str) {
return escaped;
}
bool output_csv(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
// double quote should be escaped by another double quote. (rfc4180)
static char * escape_double_quotes_in_csv(const char * str) {
if (str == NULL) {
return NULL;
}
size_t escaped_length = strlen(str) + 1;
for (size_t i = 0; str[i] != '\0'; i++) {
if (str[i] == '"') {
escaped_length++;
}
}
char *escaped = (char *)calloc(escaped_length, 1); // pre-zeroed
if (escaped == NULL) {
return NULL;
}
size_t pos = 0;
for (size_t i = 0; str[i] != '\0'; i++) {
if (str[i] == '"') {
escaped[pos++] = '"';
}
escaped[pos++] = str[i];
}
// no need to set zero due to calloc() being used prior
return escaped;
}
static bool output_csv(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
std::ofstream fout(fname);
if (!fout.is_open()) {
fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname);
@ -484,7 +531,7 @@ bool output_csv(struct whisper_context * ctx, const char * fname, const whisper_
const char * text = whisper_full_get_segment_text(ctx, i);
const int64_t t0 = whisper_full_get_segment_t0(ctx, i);
const int64_t t1 = whisper_full_get_segment_t1(ctx, i);
char * text_escaped = escape_double_quotes_and_backslashes(text);
char * text_escaped = escape_double_quotes_in_csv(text);
//need to multiply times returned from whisper_full_get_segment_t{0,1}() by 10 to get milliseconds.
fout << 10 * t0 << "," << 10 * t1 << ",";
@ -498,7 +545,7 @@ bool output_csv(struct whisper_context * ctx, const char * fname, const whisper_
return true;
}
bool output_score(struct whisper_context * ctx, const char * fname, const whisper_params & /*params*/, std::vector<std::vector<float>> /*pcmf32s*/) {
static bool output_score(struct whisper_context * ctx, const char * fname, const whisper_params & /*params*/, std::vector<std::vector<float>> /*pcmf32s*/) {
std::ofstream fout(fname);
fprintf(stderr, "%s: saving output to '%s'\n", __func__, fname);
@ -517,7 +564,7 @@ bool output_score(struct whisper_context * ctx, const char * fname, const whispe
return true;
}
bool output_json(
static bool output_json(
struct whisper_context * ctx,
const char * fname,
const whisper_params & params,
@ -663,7 +710,8 @@ bool output_json(
times_o(token.t0, token.t1, false);
}
value_i("id", token.id, false);
value_f("p", token.p, true);
value_f("p", token.p, false);
value_f("t_dtw", token.t_dtw, true);
end_obj(j == (n - 1));
}
end_arr(!params.diarize && !params.tinydiarize);
@ -687,7 +735,7 @@ bool output_json(
// karaoke video generation
// outputs a bash script that uses ffmpeg to generate a video with the subtitles
// TODO: font parameter adjustments
bool output_wts(struct whisper_context * ctx, const char * fname, const char * fname_inp, const whisper_params & params, float t_sec, std::vector<std::vector<float>> pcmf32s) {
static bool output_wts(struct whisper_context * ctx, const char * fname, const char * fname_inp, const whisper_params & params, float t_sec, std::vector<std::vector<float>> pcmf32s) {
std::ofstream fout(fname);
fprintf(stderr, "%s: saving output to '%s'\n", __func__, fname);
@ -812,7 +860,7 @@ bool output_wts(struct whisper_context * ctx, const char * fname, const char * f
return true;
}
bool output_lrc(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
static bool output_lrc(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
std::ofstream fout(fname);
if (!fout.is_open()) {
fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname);
@ -852,14 +900,59 @@ bool output_lrc(struct whisper_context * ctx, const char * fname, const whisper_
return true;
}
static void cb_log_disable(enum ggml_log_level , const char * , void * ) { }
int main(int argc, char ** argv) {
whisper_params params;
// If the only argument starts with "@", read arguments line-by-line
// from the given file.
std::vector<std::string> vec_args;
if (argc == 2 && argv != nullptr && argv[1] != nullptr && argv[1][0] == '@') {
// Save the name of the executable.
vec_args.push_back(argv[0]);
// Open the response file.
char const * rspfile = argv[1] + sizeof(char);
std::ifstream fin(rspfile);
if (fin.is_open() == false) {
fprintf(stderr, "error: response file '%s' not found\n", rspfile);
return 1;
}
// Read the entire response file.
std::string line;
while (std::getline(fin, line)) {
vec_args.push_back(line);
}
// Use the contents of the response file as the command-line arguments.
argc = static_cast<int>(vec_args.size());
argv = static_cast<char **>(alloca(argc * sizeof (char *)));
for (int i = 0; i < argc; ++i) {
argv[i] = const_cast<char *>(vec_args[i].c_str());
}
}
if (whisper_params_parse(argc, argv, params) == false) {
whisper_print_usage(argc, argv, params);
return 1;
}
// remove non-existent files
for (auto it = params.fname_inp.begin(); it != params.fname_inp.end();) {
const auto fname_inp = it->c_str();
if (*it != "-" && !is_file_exist(fname_inp)) {
fprintf(stderr, "error: input file not found '%s'\n", fname_inp);
it = params.fname_inp.erase(it);
continue;
}
it++;
}
if (params.fname_inp.empty()) {
fprintf(stderr, "error: no input files specified\n");
whisper_print_usage(argc, argv, params);
@ -878,10 +971,39 @@ int main(int argc, char ** argv) {
exit(0);
}
if (params.no_prints) {
whisper_log_set(cb_log_disable, NULL);
}
// whisper init
struct whisper_context_params cparams;
cparams.use_gpu = params.use_gpu;
struct whisper_context_params cparams = whisper_context_default_params();
cparams.use_gpu = params.use_gpu;
cparams.flash_attn = params.flash_attn;
if (!params.dtw.empty()) {
cparams.dtw_token_timestamps = true;
cparams.dtw_aheads_preset = WHISPER_AHEADS_NONE;
if (params.dtw == "tiny") cparams.dtw_aheads_preset = WHISPER_AHEADS_TINY;
if (params.dtw == "tiny.en") cparams.dtw_aheads_preset = WHISPER_AHEADS_TINY_EN;
if (params.dtw == "base") cparams.dtw_aheads_preset = WHISPER_AHEADS_BASE;
if (params.dtw == "base.en") cparams.dtw_aheads_preset = WHISPER_AHEADS_BASE_EN;
if (params.dtw == "small") cparams.dtw_aheads_preset = WHISPER_AHEADS_SMALL;
if (params.dtw == "small.en") cparams.dtw_aheads_preset = WHISPER_AHEADS_SMALL_EN;
if (params.dtw == "medium") cparams.dtw_aheads_preset = WHISPER_AHEADS_MEDIUM;
if (params.dtw == "medium.en") cparams.dtw_aheads_preset = WHISPER_AHEADS_MEDIUM_EN;
if (params.dtw == "large.v1") cparams.dtw_aheads_preset = WHISPER_AHEADS_LARGE_V1;
if (params.dtw == "large.v2") cparams.dtw_aheads_preset = WHISPER_AHEADS_LARGE_V2;
if (params.dtw == "large.v3") cparams.dtw_aheads_preset = WHISPER_AHEADS_LARGE_V3;
if (params.dtw == "large.v3.turbo") cparams.dtw_aheads_preset = WHISPER_AHEADS_LARGE_V3_TURBO;
if (cparams.dtw_aheads_preset == WHISPER_AHEADS_NONE) {
fprintf(stderr, "error: unknown DTW preset '%s'\n", params.dtw.c_str());
return 3;
}
}
struct whisper_context * ctx = whisper_init_from_file_with_params(params.model.c_str(), cparams);
@ -893,6 +1015,29 @@ int main(int argc, char ** argv) {
// initialize openvino encoder. this has no effect on whisper.cpp builds that don't have OpenVINO configured
whisper_ctx_init_openvino_encoder(ctx, nullptr, params.openvino_encode_device.c_str(), nullptr);
if (!params.grammar.empty()) {
auto & grammar = params.grammar_parsed;
if (is_file_exist(params.grammar.c_str())) {
// read grammar from file
std::ifstream ifs(params.grammar.c_str());
const std::string txt = std::string((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());
grammar = grammar_parser::parse(txt.c_str());
} else {
// read grammar from string
grammar = grammar_parser::parse(params.grammar.c_str());
}
// will be empty (default) if there are parse errors
if (grammar.rules.empty()) {
fprintf(stderr, "error: failed to parse grammar \"%s\"\n", params.grammar.c_str());
return 4;
} else {
fprintf(stderr, "%s: grammar:\n", __func__);
grammar_parser::print_grammar(stderr, grammar);
fprintf(stderr, "\n");
}
}
for (int f = 0; f < (int) params.fname_inp.size(); ++f) {
const auto fname_inp = params.fname_inp[f];
const auto fname_out = f < (int) params.fname_out.size() && !params.fname_out[f].empty() ? params.fname_out[f] : params.fname_inp[f];
@ -905,29 +1050,28 @@ int main(int argc, char ** argv) {
continue;
}
// print system information
{
if (!whisper_is_multilingual(ctx)) {
if (params.language != "en" || params.translate) {
params.language = "en";
params.translate = false;
fprintf(stderr, "%s: WARNING: model is not multilingual, ignoring language and translation options\n", __func__);
}
}
if (params.detect_language) {
params.language = "auto";
}
if (!params.no_prints) {
// print system information
fprintf(stderr, "\n");
fprintf(stderr, "system_info: n_threads = %d / %d | %s\n",
params.n_threads*params.n_processors, std::thread::hardware_concurrency(), whisper_print_system_info());
}
// print some info about the processing
{
// print some info about the processing
fprintf(stderr, "\n");
if (!whisper_is_multilingual(ctx)) {
if (params.language != "en" || params.translate) {
params.language = "en";
params.translate = false;
fprintf(stderr, "%s: WARNING: model is not multilingual, ignoring language and translation options\n", __func__);
}
}
if (params.detect_language) {
params.language = "auto";
}
fprintf(stderr, "%s: processing '%s' (%d samples, %.1f sec), %d threads, %d processors, lang = %s, task = %s, %stimestamps = %d ...\n",
fprintf(stderr, "%s: processing '%s' (%d samples, %.1f sec), %d threads, %d processors, %d beams + best of %d, lang = %s, task = %s, %stimestamps = %d ...\n",
__func__, fname_inp.c_str(), int(pcmf32.size()), float(pcmf32.size())/WHISPER_SAMPLE_RATE,
params.n_threads, params.n_processors,
params.n_threads, params.n_processors, params.beam_size, params.best_of,
params.language.c_str(),
params.translate ? "translate" : "transcribe",
params.tinydiarize ? "tdrz = 1, " : "",
@ -940,7 +1084,8 @@ int main(int argc, char ** argv) {
{
whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
wparams.strategy = params.beam_size > 1 ? WHISPER_SAMPLING_BEAM_SEARCH : WHISPER_SAMPLING_GREEDY;
const bool use_grammar = (!params.grammar_parsed.rules.empty() && !params.grammar_rule.empty());
wparams.strategy = (params.beam_size > 1 || use_grammar) ? WHISPER_SAMPLING_BEAM_SEARCH : WHISPER_SAMPLING_GREEDY;
wparams.print_realtime = false;
wparams.print_progress = params.print_progress;
@ -958,23 +1103,43 @@ int main(int argc, char ** argv) {
wparams.thold_pt = params.word_thold;
wparams.max_len = params.output_wts && params.max_len == 0 ? 60 : params.max_len;
wparams.split_on_word = params.split_on_word;
wparams.audio_ctx = params.audio_ctx;
wparams.speed_up = params.speed_up;
wparams.debug_mode = params.debug_mode;
wparams.tdrz_enable = params.tinydiarize; // [TDRZ]
wparams.suppress_regex = params.suppress_regex.empty() ? nullptr : params.suppress_regex.c_str();
wparams.initial_prompt = params.prompt.c_str();
wparams.greedy.best_of = params.best_of;
wparams.beam_search.beam_size = params.beam_size;
wparams.temperature_inc = params.no_fallback ? 0.0f : wparams.temperature_inc;
wparams.temperature_inc = params.no_fallback ? 0.0f : params.temperature_inc;
wparams.temperature = params.temperature;
wparams.entropy_thold = params.entropy_thold;
wparams.logprob_thold = params.logprob_thold;
wparams.no_timestamps = params.no_timestamps;
whisper_print_user_data user_data = { &params, &pcmf32s, 0 };
const auto & grammar_parsed = params.grammar_parsed;
auto grammar_rules = grammar_parsed.c_rules();
if (use_grammar) {
if (grammar_parsed.symbol_ids.find(params.grammar_rule) == grammar_parsed.symbol_ids.end()) {
fprintf(stderr, "%s: warning: grammar rule '%s' not found - skipping grammar sampling\n", __func__, params.grammar_rule.c_str());
} else {
wparams.grammar_rules = grammar_rules.data();
wparams.n_grammar_rules = grammar_rules.size();
wparams.i_start_rule = grammar_parsed.symbol_ids.at(params.grammar_rule);
wparams.grammar_penalty = params.grammar_penalty;
}
}
// this callback is called on each new segment
if (!wparams.print_realtime) {
wparams.new_segment_callback = whisper_print_segment_callback;
@ -1071,7 +1236,9 @@ int main(int argc, char ** argv) {
}
}
whisper_print_timings(ctx);
if (!params.no_prints) {
whisper_print_timings(ctx);
}
whisper_free(ctx);
return 0;

View File

@ -0,0 +1,7 @@
import whisper_processor
try:
result = whisper_processor.process_audio("./audio/wake_word_detected16k.wav", "base.en")
print(result)
except Exception as e:
print(f"Error: {e}")

View File

@ -0,0 +1,54 @@
import subprocess
import sys
import os
def process_audio(wav_file, model_name="base.en"):
"""
Processes an audio file using a specified model and returns the processed string.
:param wav_file: Path to the WAV file
:param model_name: Name of the model to use
:return: Processed string output from the audio processing
:raises: Exception if an error occurs during processing
"""
model = f"./models/ggml-{model_name}.bin"
# Check if the file exists
if not os.path.exists(model):
raise FileNotFoundError(f"Model file not found: {model} \n\nDownload a model with this command:\n\n> bash ./models/download-ggml-model.sh {model_name}\n\n")
if not os.path.exists(wav_file):
raise FileNotFoundError(f"WAV file not found: {wav_file}")
full_command = f"./main -m {model} -f {wav_file} -nt"
# Execute the command
process = subprocess.Popen(full_command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# Get the output and error (if any)
output, error = process.communicate()
if error:
raise Exception(f"Error processing audio: {error.decode('utf-8')}")
# Process and return the output string
decoded_str = output.decode('utf-8').strip()
processed_str = decoded_str.replace('[BLANK_AUDIO]', '').strip()
return processed_str
def main():
if len(sys.argv) >= 2:
wav_file = sys.argv[1]
model_name = sys.argv[2] if len(sys.argv) == 3 else "base.en"
try:
result = process_audio(wav_file, model_name)
print(result)
except Exception as e:
print(f"Error: {e}")
else:
print("Usage: python whisper_processor.py <wav_file> [<model_name>]")
if __name__ == "__main__":
main()

View File

@ -36,7 +36,7 @@ struct whisper_filters {
};
// quantize a model
bool whisper_model_quantize(const std::string & fname_inp, const std::string & fname_out, ggml_ftype ftype) {
static bool whisper_model_quantize(const std::string & fname_inp, const std::string & fname_out, ggml_ftype ftype) {
gpt_vocab vocab;
printf("%s: loading model from '%s'\n", __func__, fname_inp.c_str());

Some files were not shown because too many files have changed in this diff Show More