whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2024-12-24 06:46:37 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	0775374750	metal : fix unused warnings (llama/0)	2024-02-19 15:53:24 +02:00
Kawrakow	2c1ad21ba8	1.5 bit quantization (llama/5453) * iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-19 15:53:23 +02:00
Georgi Gerganov	eca5ff9868	ggml : add ALiBi support for ggml_soft_max_ext (llama/5488)	2024-02-19 15:53:23 +02:00
Georgi Gerganov	9b735cea77	metal : add im2col F32 dst support (llama/5132)	2024-02-10 09:55:46 +02:00
Kawrakow	361ecebe90	ggml : fix IQ3_XXS on Metal (llama/5219) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-30 21:28:00 +02:00
Kawrakow	6adb969b09	SOTA 3-bit quants (llama/5196) * iq3_xxs: quantize/dequantize RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more. * iq3_xxs: CUDA dequantize works * iq2_xxs: tuning quantization * iq3_xxs: starting to look better PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size. * iq3_xxs: CUDA dot product We have PP-512: 5891 t/s TG-128: 143.9 t/s * iq3_xxs: scalar and AVX2 dot products * iq3_xxs: ARM_NEON and Metal Metal performance is decent, ARM_NEON is pathetic * iq3_xxs: slightly better grid points * Faster iq3_xxs and iq2_xs dot products on CUDA * iq3_xxs: add some quant mix * iq3_xxs: fix failing quantization test Dot product still fails. Is this real? * iq3_xxs: hopefully fix ROCm * iq3_xxs: failing tests This time the dot product accuracy did find an actual bug in the AVX2 implementation. * Add IQ3_XXS to test-backend-ops --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-30 21:27:59 +02:00
Kawrakow	97b12212dd	ggml : SOTA 2-bit quants (add IQ2_XS) (llama/4856) * iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-11 21:50:01 +02:00
Georgi Gerganov	a0a64a19dd	metal : improve dequantize precision to match CPU (llama/4836) ggml-ci	2024-01-11 21:50:01 +02:00
Kawrakow	10651bddf6	SOTA 2-bit quants (llama/4773) * iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-11 21:50:00 +02:00
Georgi Gerganov	4cdd9aad9b	metal : add kernel_get_rows_i32 ggml-ci	2024-01-03 14:43:51 +02:00
Georgi Gerganov	f38c057503	metal : optimize ggml_mul_mat_id (faster Mixtral PP) (llama/4725) * ggml : disable fast-math for Metal (cmake build only) ggml-ci * metal : fix Metal API debug warnings * cmake : add -fno-inline for Metal build (llama/4545) * metal : fix API debug warnings * metal : fix compile warnings * metal : use uint64_t for strides * cmake : rename option to LLAMA_METAL_SHADER_DEBUG * metal : fix mat-vec Q8_0 kernel for BS > 1 * metal : normalize mat-vec kernel signatures * cmake : respect LLAMA_QKK_64 option * metal : fix mat-vec Q4_K kernel for QK_K == 64 * metal : optimizing ggml_mul_mat_id (wip) * metal : minor fix * metal : opt mul_mm_id	2024-01-03 14:43:51 +02:00
Georgi Gerganov	1e5544b39b	metal : enable shader debugging (cmake option) (llama/4705) * ggml : disable fast-math for Metal (cmake build only) ggml-ci * metal : fix Metal API debug warnings * cmake : add -fno-inline for Metal build (llama/4545) * metal : fix API debug warnings * metal : fix compile warnings * metal : use uint64_t for strides * cmake : rename option to LLAMA_METAL_SHADER_DEBUG * metal : fix mat-vec Q8_0 kernel for BS > 1 * metal : normalize mat-vec kernel signatures * cmake : respect LLAMA_QKK_64 option * metal : fix mat-vec Q4_K kernel for QK_K == 64 ggml-ci	2024-01-03 14:43:51 +02:00
Georgi Gerganov	3a5302108d	sync : ggml (ggml_scale, ggml_row_size, etc.) (#1677 ) * sync : ggml * sync : llama.cpp * talk-llama : fix obsolete param * ggml-alloc : fix ggml_tallocr_is_own * talk.wasm : update to new ggml * ggml : fix type punning in ggml_scale * ggml : cuda jetson + arm quants warnings	2023-12-22 17:53:39 +02:00
Georgi Gerganov	8171e621fc	sync : ggml (Metal fixes, new ops, tests) (#1633 ) * sync : ggml (Metal fixes, new ops, tests) * cuda : fix bin bcast when src1 and dst have different types	2023-12-13 21:55:03 +02:00
Georgi Gerganov	afce6fa113	sync : ggml (new ops, new backend, etc) (#1602 ) * sync : ggml (new ops, new backend, etc) * whisper : remove obsolete broadcasting code * ggml : remove backend self-registers + fix ggml_concat + n_task logic * metal : fix assert * metal : print resource path * whisper : fix bug if metal init fails	2023-12-07 22:27:19 +02:00
Georgi Gerganov	b0502836b8	whisper : add full CUDA and Metal offloading (#1472 ) * whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state	2023-11-12 15:31:08 +02:00
Georgi Gerganov	f96e1c5b78	sync : ggml (backend v2, k-quants, CUDA opts, Metal opts, etc.) (#1422 ) * sync : ggml (backend v2, k-quants, CUDA opts, Metal opts, etc.) * metal : allow env metal variable to override resource path (#1415) * Allow env variable to override resource path * Update ggml-metal.m --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * sync : restore common / main from `master` * sync : restore whisper from `master` * talk-llama : update to latest llama.cpp * ruby : fix build * ggml : fix 32-bit ARM build * ggml : fix MIN / MAX macro collisions + update ios bindings * ggml : fix ifdefs and MIN / MAX again * exampels : fix Obj-C and Swift examples * ggml : fix 32-bit ARM compatibility * ggml : one more attempt to fix 32-bit ARM compat * whisper : fix support for larger graphs --------- Co-authored-by: Chris Raethke <codesoda@users.noreply.github.com>	2023-11-03 21:35:05 +02:00
Georgi Gerganov	b8432f28f4	metal : add F32 support + update bench output	2023-09-15 13:56:08 +03:00
Georgi Gerganov	93935980f8	whisper : Metal and ggml-alloc support (#1270 ) * metal : init * whisper : factor out graph builds * whisper : allocate encoder and decoder using ggml-alloc * whisper : ggml-alloc is now supported * whisper : CoreML support ggml-alloc * build : fix ggml-alloc * ios : update submodule * extra : update sync-ggml.sh script to also sync ggml-alloc * ci : see if this is causing the crash * whisper : refactor ggml-alloc init * whisper.android : try to fix build * whisper : initial Metal version * ci : try to debug vmem issue * metal : decoder works on GPU! * metal : add multi-decoder support * ggml : fix ggml_nbytes (probably temp solution) * metal : run "cross" step on the GPU * whisper : remove ggml_repeat in the encoder * whisper : offload the Encoder to Metal * ggml : use simpler ggml_bytes() implementation * ggml-alloc : try to make CI happy by reducing vram to 128GB * whisper : add whisper_allocr to wrap ggml_allocr * whisper : factor out alloc init in a function * cmake : update to support Metal build * whisper : add <functional> header * objc : fix build (no Metal yet) * ios : add Metal support * swiftui : fix build * metal : speed-up KQ multiplication * metal : sync latest llama.cpp kernels * readme : add Metal info * ios : update submodule * coreml : add code to toggle Core ML config (CPU, ANE, GPU) * bench : fix timings by running a pre-heat * bench : start benching the decoder * whisper : add ggml_mul_mat_pad * bench : fix uninitialized vars * whisper : add comment for disabling mul-mat padding * whisper : add description of ggml_mul_mat_pad * whisper : clean-up ggml_mul_mat_pad * metal : remove the "concurrent" flag * bench : variable n_past * ios : update SPM package	2023-09-15 12:18:18 +03:00
Georgi Gerganov	b39809668a	sync : ggml (HBM + Metal + style) (#1264 )	2023-09-08 17:58:31 +03:00
Georgi Gerganov	c3f319d7c2	ggml : sync latest llama.cpp (view_src + alloc improvements) (#1247 ) * ggml : sync latest llama.cpp (view_src + alloc improvements) * ggml : fix build	2023-09-05 20:57:27 +03:00
Georgi Gerganov	59a3d0cb57	ggml : sync (ggml-alloc, GPU, eps, etc.) (#1220 ) * ggml : sync (ggml-alloc, GPU, eps, etc.) * ggml : fix build * wasm : fix build	2023-09-05 13:54:40 +03:00
Georgi Gerganov	d6509bf78d	ggml : sync latest repo (mostly refactoring changes)	2023-07-02 21:46:09 +03:00
Georgi Gerganov	f1c9df5806	metal : sync ggml-metal (ref #1047 )	2023-06-25 15:40:39 +03:00

24 Commits