ggml : group all experts in a single ggml_mul_mat_id (llama/6505)

* ggml : group all experts in a single ggml_mul_mat_id
cuda : improve mmid row copy

* cuda : fix bin bcast with non-cont src0

* test-backend-ops : only run all mul mat tests for base types

* llama : disable moe offloading with SYCL

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
slaren
2024-04-18 15:18:48 +02:00
committed by Georgi Gerganov
parent c97796aa0f
commit c96b0a938e
8 changed files with 730 additions and 681 deletions

6
ggml.h
View File

@ -1170,13 +1170,11 @@ extern "C" {
enum ggml_prec prec);
// indirect matrix multiplication
// ggml_mul_mat_id(ctx, as, ids, id, b) ~= ggml_mul_mat(as[ids[id]], b)
GGML_API struct ggml_tensor * ggml_mul_mat_id(
struct ggml_context * ctx,
struct ggml_tensor * as,
struct ggml_tensor * ids,
int id,
struct ggml_tensor * b);
struct ggml_tensor * b,
struct ggml_tensor * ids);
// A: m columns, n rows,
// B: p columns, n rows,