Commit Graph

1196 Commits

Author SHA1 Message Date
JidongZhang-THU
12c462d656
llava : add MobileVLM support (llama/5132)
* New Feature:
    1. Sum_Rows:
        fix cuda kernel overflow
        fix block shape error when nrows too big
    2. Im2Col:
        Support Batch in cuda
        Support f32 to f32 both in cpu && cuda
    3. DepthWiseConv:
        Support by Im2Col && MulMat
    4. Pool_2d:
        Supoort avg pooling in cuda
    5. HardSigmoid:
        Imp in cuda
    6. HardSwish:
        Imp in cuda

* fix tabs instead of spaces

* code clean

* CUDA POOL2D

* ADD POOL2D test case in test-backend-ops.cpp

* code clean

* fix pool2d_kernel

nits

* fix bug in pool2d kernel

* fix avg pooling, count_include_pad

nits

* test-backend-ops : add more pool_2d tests

* cuda : fix warnings and formatting

* ggml : check types in release builds too in pool_2d

* test-backend-ops : remove f16 pool_2d tests

* cuda : more style fixes

* Add assert in ggml_cuda_op_pool2d

* pool2d float padding fallback

* test-backend-ops : add dst_type to im2col

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-10 09:55:46 +02:00
slaren
fc7b0e2c28
ggml : limit n_threads to the max n_tasks (llama/5238) 2024-02-10 09:55:46 +02:00
Jared Van Bortel
f850a067ed
kompute : llama-bench support and ggml_cpu_has_kompute() (llama/5226) 2024-02-10 09:55:46 +02:00
Michael Podvitskiy
f75e1197f1
ggml : add abort_callback for cpu backend (ggml/725)
* a way to use abort_callback with the cpu backend

* whisper update
2024-02-10 09:55:46 +02:00
Georgi Gerganov
aa8a75e287
extra : update sync scripts 2024-02-10 09:55:19 +02:00
Valentin Gosu
80e8a2ea39
server : allow CORS request with authorization headers ()
Whisper plugin in Obsidian requires an API key which is
then sent as an authorization header.
However, the presence of an authorization header requires
a CORS Preflight, so both the OPTIONS method and
the Access-Control-Allow-Headers: authorization must be
handled.
2024-02-09 17:42:41 +02:00
Neuman Vong
19f8048139
whisper.android : how to build with CLBlast ()
* FetchContent

* OpenCL

* Documentation and make optional

* Specify GGML build options in build.gradle

* Use gradle properties

* @ggerganov

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* @gpokat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-09 17:39:05 +02:00
Didzis Gosko
0f80e5a80a
whisper : expose CUDA device setting in public API ()
* Makefile : allow to override CUDA_ARCH_FLAG

* whisper : allow to select GPU (CUDA) device from public API
2024-02-09 17:27:47 +02:00
Didzis Gosko
b6559333ff
make : add macOS deployment target option () 2024-02-09 17:26:29 +02:00
Georgi Gerganov
434b8f3b96
talk-llama : stream response () 2024-02-06 19:56:12 +02:00
Georgi Gerganov
7a74e929c8
sync : ggml () 2024-01-30 21:30:26 +02:00
Kawrakow
361ecebe90
ggml : fix IQ3_XXS on Metal (llama/5219)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30 21:28:00 +02:00
Georgi Gerganov
807cbc672e
sync : ggml (llama/0) 2024-01-30 21:27:59 +02:00
Kawrakow
98ae5276b7
Faster AVX2 dot product for IQ2_XS (llama/5187)
* iq2xs: faster AVX2 dot product

* iq2xs: small AVX2 imrovement

* Speed up computing sign bits in AVX2 iq2_xs dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Peter Reid <peter@peterreid.net>
2024-01-30 21:27:59 +02:00
Kawrakow
6adb969b09
SOTA 3-bit quants (llama/5196)
* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30 21:27:59 +02:00
Paul Tsochantaris
8a7d6ff51a
ggml alloc: Fix for null dereference on alloc failure (llama/5200)
* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated

* Freeing the allocated buffers rather than the pointer in ggml-alloc.c

* Fixed the fix of the fix
2024-01-30 21:27:59 +02:00
Jared Van Bortel
25f650a8e8
Nomic Vulkan backend (llama/4456)
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: niansa <anton-sa@web.de>
Co-authored-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Aaron Miller <apage43@ninjawhale.com>
Co-authored-by: ToKiNoBug <tokinobug@163.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-01-30 21:27:59 +02:00
slaren
44e517f074
ggml : add max buffer sizes to opencl and metal backends (llama/5181) 2024-01-30 21:27:59 +02:00
Paul Tsochantaris
cb9de61659
metal : free metal objects (llama/5161)
* Releasing MTLFunction references after Metal pipeline construction

* Keeping the `ggml_metal_kernel` structure

* Spacing fix

* Whitespace fix
2024-01-30 21:27:59 +02:00
Georgi Gerganov
a2ef80d66f
gguf : fix comparison (ggml/715)
ggml-ci
2024-01-30 21:27:59 +02:00
John Balis
baa190446a
ggml_cuda_cpy support for 4d tensors and float16->float32 upcasting (ggml/686)
* added cuda float16->float32 upcasting to ggml_cuda_cpy

* added ability to copy 4d tensors with the cuda backend

* added tests for float16_>float32 upcast and 4d tensor cuda copys

* added 4d copy test for float32->float16 copy

* applied patch suggested by @iamlemec

* simplify cpy tests

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-30 21:27:59 +02:00
Georgi Gerganov
8f5220d81f
gguf : add input validation, prevent integer overflows (ggml/709)
* gguf : add input validation, prevent integer overflows

ggml-ci

* gguf : fix switch default case

* gguf : sanitize info->n_dims and info->type

ggml-ci

* gguf : assert GGUF_TYPE_SIZE access

ggml-ci

* ggml : assert mallocs are successful

ggml-ci

* gguf : prevent integer overflow

* gguf : sanitize tensor info

ggml-ci

* gguf : stricter limit on the number of items

ggml-ci
2024-01-30 21:27:58 +02:00
Georgi Gerganov
8e391fcf3a
ci : fix yolo URLs + fix metal capture (ggml/712) 2024-01-30 21:27:58 +02:00
Jack Mousseau
593657054e
metal : add debug capture backend function (ggml/694)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-30 21:27:58 +02:00
JacobLinCool
ae5c4f7340
common : fix wav buffer detection () 2024-01-30 19:35:08 +02:00
JacobLinCool
baa30bacdb
server : add fields to verbose_json response ()
* server: include additional fields in the verbose_json response as OpenAI does

* server: show request examples on home page

* server: todo note for compression_ratio and no_speech_prob

* server: add simple demo form to the homepage
2024-01-30 14:15:55 +02:00
jwijffels
3e6fad07aa
make : update MSYS_NT ()
I just upgraded the R wrapper at https://github.com/bnosac/audio.whisper to use whisper.cpp 1.5.4
I'm working on Windows and noticed while doing that that it did not pick up the relevant CFLAGS/CXXFLAGS as my system showed

```
I whisper.cpp build info: 
I UNAME_S:  MSYS_NT-10.0-19045
I UNAME_P:  unknown
I UNAME_M:  x86_64
```

Many thanks for all the tremendous hard work on maintaining whisper.cpp!
2024-01-30 14:13:49 +02:00
Georgi Gerganov
e72e4158de
talk-llama : sync llama.cpp 2024-01-28 19:44:10 +02:00
Georgi Gerganov
bd41733db2
sync : ggml 2024-01-28 19:30:32 +02:00
0cc4m
23c648e98d
ggml : add Vulkan backend (llama/2059)
* Vulkan loader code

* Fix matmul kernel, continue implementation

* Continue implementation

* Vulkan memory management

* Vulkan development

* Matmul call

* Add aligned malloc and free for VMA

* Continue implementation

* First matmul success

* GEMM Kernel optimization

* 1D Blocktiling

* 2D Blocktiling

* Write coalescing

* Continue vulkan implementation and optimization

* First FP16 attempt, disabled for now

* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel

* Enable device extensions properly, restore fp16 matmul op

* Fix mulmat_f16

* Output FP32 in fp16 matmul shader

* Fix f16_to_f32 kernel

* dequant_q4_0 kernel

* Add VMA library

* Avoid requesting dedicated memory, VMA can decide that by itself

* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly

* add cmake commands

* Add 2d write operation, profiling code

* Fix 2d write

* Fix queue selection for AMD RADV

* Fix trailing whitespace in vk_mem_alloc.h

* Add WIP warp tile mat mul shaders

* Disable glslc optimization

* Disable glslc optimization for CMake

* Optimize warptile matmul shader, replace blocktile with it

* Add split-k optimization for small matrix multiplication

Use semaphores for synchronization instead of fences or waitidle

Rework async write/read for synchronization

* Fix validation errors, improve compatibility with AMD GPUs

* Rework command buffer handling

* Variable matmul kernel using specialization constants

* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints

* Reuse semaphores

* Handle stage flags during command buffer submission properly

* Increase matmul test runs for consistent results

* Fix F32 matmul

* Add vectorized loading and zeropadding for matrix multiplication

* Use pinned memory for f16 preprocessing

* Don't force aligned matmul

* Don't free before queue done

* Replace VMA library with native Vulkan buffer management

* Basic offloading support with mul_f32 and dmmv for q4_0

* Run glslc commands in parallel

* Unroll loops in dmmv shader

* Reduce usage of waitIdle

* Reuse pinned allocation for f16 conversion

* Handle devices with only a single queue

* Fix trailing whitespace in CMakeLists.txt

* Allow parallel execution of kernels, parallelize third and fourth dimension calls

* Add fallback for devices only supporting one DescriptorSet per DescriptorPool

* Move to graph function similar to CUDA implementation

* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function

* Add F32 dmmv shaders

* Batch submissions

* Add .spv to gitignore

* Split off matrix vector multiplication for separate optimization

* Use single command buffer for matrix vector multiplication ops

* Reduce overhead of mul_f32 calls by using a single command buffer

* Add submission batching to mul_f32

* Fix tests

* Add missing barrier

* Add further missing barrier

* Add further ops

* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions

* Remove unnecessary cblas link

* Fix descriptor set pre-allocation assert

* Add runtime shader compilation, start transferring shaders to this approach

* Transfer remaining shaders to header and compile on runtime

* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16

* Add support for q4_1, q5_0, q5_1 and q8_0

* Remove unnecessary scalar layout extension

* Parse graph early to pre-record command buffers

* Add q6_k support

* Add multi-submit for command buffers

* Fix q6_k dequant shader for AMD

* Fix q6_k for GPUs without fp16 support

* Simplify q6_k fp16 fix

* Minor fixes

* Fix wg_denom of m-mulmat shaders

* Add Python-based Vulkan shader generator

* Replace shaderc dependency with precompiled shaders

Fix python script to generate shaders

* Clean up code

* Fix shader generator script Windows compatibility

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>

* Close file before deletion

* Fix vulkan shader fp32 name

* Add q2_k and q3_k support

Add validation check to compare shader results to cpu results

* Add q4_k support

* Add q5_k support

* Bake SPIR-V bytecode into the library instead of loading shaders from file

* Switch to signal semaphores for flexibility

Prepare broadcasting support for mul mat

* Finish broadcasting mul mat support for GQA

* Clean up unused functions

Add repeat op

* Add further ops, not yet enabled. Improve semaphore code

* Reduce number of used semaphores by utilizing timelines more properly

* Remove queue information

* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations

* Add Vulkan to llama-bench

* Remove cblas dependency

* Fix matmul k-split bug

* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader

* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug

* Fix issues with float16 overflows in shaders

* Fix issues with older Vulkan headers on Ubuntu 22.04

* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers

* Implement further ops, rework op_f32 calls, fix bugs

* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code

* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders

* Merge upstream changes, fix conflicts, adapt soft_max op

* Fix Python and shader header format

* Free model gpu buffers on exit

* Use single queue per device to simplify code

* Add matmul shader support for running multiple calculations in parallel

* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible

* Fix missing event cast

* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity

* Fix warning about empty C function parameters

* Fix compiler warnings

* Properly implement Vulkan backend buffer handling

* Fix oversized host staging buffers

* Simplify barrier synchronization calls

* Fix gcc warnings

* Implement max_size for backend buffer types to limit the size of a single allocation

* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size

* refactor multi buf

* Disable unsupported ops to fix tests

* Check for maintenance4 support before using it

* Handle devices with only a single queue

* Fix single queue logic

* propagate buffer usage in multi buffers

* Implement rope_neox op

* Cleanup header and other files

* Simplify gpu_extras by removing events and putting staging memcpys into contexts

* Move queue into context

Add not-yet-enabled async backend ops

* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization

* Add get_max_size to SYCL backend.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : fix trailing whitespace

---------

Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 19:30:20 +02:00
Abhilash Majumder
75ab2d06f5
ggml : add unified SYCL backend for Intel GPUs (llama/2690)
* first update for migration

* update init_cublas

* add debug functio, commit all help code

* step 1

* step 2

* step3 add fp16, slower 31->28

* add GGML_LIST_DEVICE function

* step 5 format device and print

* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue

* support main device is non-zero

* step7 add debug for code path, rm log

* step 8, rename all macro & func from cuda by sycl

* fix error of select non-zero device, format device list

* ren ggml-sycl.hpp -> ggml-sycl.h

* clear CMAKE to rm unused lib and options

* correct queue: rm dtct:get_queue

* add print tensor function to debug

* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481

* summary dpct definition in one header file to replace folder:dpct

* refactor device log

* mv dpct definition from folder dpct to ggml-sycl.h

* update readme, refactor build script

* fix build with sycl

* set nthread=1 when sycl, increase performance

* add run script, comment debug code

* add ls-sycl-device tool

* add ls-sycl-device, rm unused files

* rm rear space

* dos2unix

* Update README_sycl.md

* fix return type

* remove sycl version from include path

* restore rm code to fix hang issue

* add syc and link for sycl readme

* rm original sycl code before refactor

* fix code err

* add know issue for pvc hang issue

* enable SYCL_F16 support

* align pr4766

* check for sycl blas, better performance

* cleanup 1

* remove extra endif

* add build&run script, clean CMakefile, update guide by review comments

* rename macro to intel hardware

* editor config format

* format fixes

* format fixes

* editor format fix

* Remove unused headers

* skip build sycl tool for other code path

* replace tab by space

* fix blas matmul function

* fix mac build

* restore hip dependency

* fix conflict

* ren as review comments

* mv internal function to .cpp file

* export funciton print_sycl_devices(), mv class dpct definition to source file

* update CI/action for sycl code, fix CI error of repeat/dup

* fix action ID format issue

* rm unused strategy

* enable llama_f16 in ci

* fix conflict

* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml

* fix ci cases for unsupported data type

* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL

* revert hip cmake changes

* fix indent

* add prefix in func name

* revert no mmq

* rm cpu blas duplicate

* fix no_new_line

* fix src1->type==F16 bug.

* pass batch offset for F16 src1

* fix batch error

* fix wrong code

* revert sycl checking in test-sampling

* pass void as arguments of ggml_backend_sycl_print_sycl_devices

* remove extra blank line in test-sampling

* revert setting n_threads in sycl

* implement std::isinf for icpx with fast math.

* Update ci/run.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add copyright and MIT license declare

* update the cmd example

---------

Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 19:30:20 +02:00
Georgi Gerganov
adc099edee
ggml : minor type fix (int64_t -> size_t) 2024-01-28 19:30:17 +02:00
Georgi Gerganov
52cce82493
common : fix input buffer check () 2024-01-27 17:33:09 +02:00
Georgi Gerganov
ef3c9ed9eb
talk-llama : sync llama.cpp 2024-01-27 17:24:53 +02:00
Georgi Gerganov
7fe3ed5e00
sync : ggml 2024-01-27 17:23:25 +02:00
0cc4m
6061241292
Add OpenCL add kernel (llama/5151)
* Add OpenCL add kernel

* Put add kernel into different string to stay within MSVC string length limit, disable float16 support due to bad results
2024-01-27 17:19:52 +02:00
slaren
0878ab7c15
cuda : fix tensor size calculation for non-split buffer (llama/5145) 2024-01-27 17:19:52 +02:00
slaren
c65edd5b64
ggml-alloc : add 10% margin to the buffer sizes (llama/5149) 2024-01-27 17:19:52 +02:00
snadampal
3c8d14e9c5
ggml : update softmax n_task calculation (llama/5126)
updated the n_task calculation to use max number of
threads possible. This has improved the prompt eval
performance by around 5% for DOT kernels and by
around 10% for MMLA kernels on AWS Graviton3.
2024-01-27 17:19:52 +02:00
Paul Tsochantaris
c3977cb2ce
metal : remove unused n_buffers and buffers (llama/5129) 2024-01-27 17:19:52 +02:00
Georgi Gerganov
6da1661bc2
metal : show compile log messages 2024-01-27 17:19:51 +02:00
Engininja2
cc56540661
cuda : fix 2-bit quants on amd hip (llama/5105)
* cuda : fix 2-bit quants on amd hip

* use __low2float intrinsic function for new quants
2024-01-27 17:19:51 +02:00
slaren
94c1ae8668
llama : pre-allocate input tensors in a separate buffer (llama/5100) 2024-01-27 17:19:51 +02:00
Georgi Gerganov
55d54359e0
metal : disable support for MUL_MAT F32 x F16 2024-01-27 17:19:51 +02:00
Johannes Gäßler
d33c2ad354
CUDA: more info when no device code (llama/5088) 2024-01-27 17:19:51 +02:00
Georgi Gerganov
9afa7ff624
minor : clean-up some warnings and style (llama/5094)
* minor : clean-up some warnings and style

ggml-ci

* ggml : add comment
2024-01-27 17:19:51 +02:00
Reinforce-II
0649289f02
ggml : parallelize FP32 conversion when using BLAS (llama/5045)
* make GGML_TASK_INIT phase can be run in multithread

* multithreaded dequantize in mul_mat when using blas library

* minor fixes

* update outdated comment
* fix coding style

* simplify code

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-27 17:19:51 +02:00
XiaotaoChen
aaeaa43878
llava : MobileVLM support (llama/4954)
* MobileVLM native implementation

* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake

* move android script to example/llava directory

* Fix the editor config checks

---------

Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
2024-01-27 17:19:51 +02:00
slaren
078b8e23bf
llama : run all KQV ops on the CPU with no KV offload (llama/5049)
ggml-ci
2024-01-27 17:19:51 +02:00
Kylin
74da3e1757
cuda : fix compile error in jetson platform (llama/4975)
* cuda: fix compile error in jetson platform

* cuda: update comment in ggml-cuda.cu

* cuda: update ggml-cuda.cu comment
2024-01-27 17:19:50 +02:00