* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.
- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.
* Fix small typo
---------
Co-authored-by: 0cc4m <picard12@live.de>
This change updates the -pc flag, so that a new xterm256 color scheme is
used. This color scheme is believed to be better for three reasons:
1. It should be friendlier to the colorblind. The scheme was designed by
Paul Tol (see: https://personal.sron.nl/~pault/). TensorBoard uses it
since 2017, so it's already popular in the machine learning community
2. It should appear to be the same colors as before to people who aren't
i.e. it's still a red-green spectrum like before but lightly modified
3. It is readable in both white and black background terminals. The neon
colors before were probably a bit too intense for white backgrounds.
* ggml: support forward pass broadcasting in ggml_sub
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
* Use assert instead of GGML_ASSERT in ggml_compute_forward_sub_f32
The check is already performed in ggml_sub_impl
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
---------
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
* ggml : reading the runtime sve config of the cpu
* change to one time init to prevent performance drop
* prefix variable to avoid possible conflicts
* revert xxhash fix and add brackets
---------
Co-authored-by: domke <673751-domke@users.noreply.gitlab.com>
* add truncate_bf16
* truncate intermediate fp32 if converting bf16 to bf16
* fix masking in __compute_fp32_to_bf16
* np.int16 no longer used
* missing cast and additional numpy 2.x fix
* ggml-impl : do not flush bf16 subnormals to zero
* ggml : add reference fp32 to bf16 conversion
The fast version is no longer equivalent for all platforms
because of the handling of subnormal values.
* gguf-py : remove flush to zero for bf16 subnormals
* gguf-py : remove float32 truncation to bf16
Rounding achieves the same thing in the cases where this was used.
* missed prototype update in merge
* merge cleanup
---------
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
* Adding support for unified memory
* adding again the documentation about unified memory
* refactoring: Moved the unified memory code in the correct location.
* Fixed compilation error when using hipblas
* cleaning up the documentation
* Updating the documentation
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* adding one more case where the PR should not be enabled
---------
Co-authored-by: matteo serva <matteo.serva@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
In these codes, we want to retain the value that they previously held
when mask[i] is false. So we should use undisturbed. With the default
agnostic policy of rvv intrinsic, these values can be held or be
written with 1s.
Co-authored-by: carter.li <carter.li@starfivetech.com>
* Update doc for MUSA
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Add GGML_MUSA in Makefile
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Add GGML_MUSA in CMake
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* CUDA => MUSA
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* MUSA adds support for __vsubss4
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Fix CI build failure
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>