ggml : group all experts in a single ggml_mul_mat_id (llama/6505)

* ggml : group all experts in a single ggml_mul_mat_id cuda : improve mmid row copy * cuda : fix bin bcast with non-cont src0 * test-backend-ops : only run all mul mat tests for base types * llama : disable moe offloading with SYCL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-21 07:58:17 +00:00 · 2024-04-18 15:18:48 +02:00
parent c97796aa0f
commit c96b0a938e
8 changed files with 730 additions and 681 deletions
--- a/ggml.h
+++ b/ggml.h
@ -1170,13 +1170,11 @@ extern "C" {
            enum ggml_prec       prec);

    // indirect matrix multiplication
-    //  ggml_mul_mat_id(ctx, as, ids, id, b) ~= ggml_mul_mat(as[ids[id]], b)
    GGML_API struct ggml_tensor * ggml_mul_mat_id(
            struct ggml_context * ctx,
            struct ggml_tensor  * as,
-            struct ggml_tensor  * ids,
-            int                   id,
-            struct ggml_tensor  * b);
+            struct ggml_tensor  * b,
+            struct ggml_tensor  * ids);

    // A: m columns, n rows,
    // B: p columns, n rows,