Fix conversion of unnormalized BF16->BF16 weights (llama/7843)

* add truncate_bf16 * truncate intermediate fp32 if converting bf16 to bf16 * fix masking in __compute_fp32_to_bf16 * np.int16 no longer used * missing cast and additional numpy 2.x fix * ggml-impl : do not flush bf16 subnormals to zero * ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used. * missed prototype update in merge * merge cleanup --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2025-06-16 22:08:07 +00:00 · 2024-08-02 21:11:39 +02:00
parent 9cf14ebcbc
commit 6cb38c3673
3 changed files with 13 additions and 8 deletions
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@ -349,6 +349,7 @@ extern "C" {
    GGML_API ggml_bf16_t ggml_fp32_to_bf16(float);
    GGML_API float       ggml_bf16_to_fp32(ggml_bf16_t);  // consider just doing << 16
    GGML_API void        ggml_bf16_to_fp32_row(const ggml_bf16_t *, float *, int64_t);
+    GGML_API void        ggml_fp32_to_bf16_row_ref(const float *, ggml_bf16_t *, int64_t);
    GGML_API void        ggml_fp32_to_bf16_row(const float *, ggml_bf16_t *, int64_t);

    struct ggml_object;