coreml : set convert_to="mlprogram" in convert

* coreml : skip model load in convert-whisper-to-coreml.py This commit updates the conversion process for Whisper models to use the "mlprogram" format instead of "neuralnetwork". The motivation for this change is that when using the "neuralnetwork" format the underlying model produced is based on protobuf and my understanding is that there are limitations to this format, such as sizes of strings and the complexity of the model. Currently when trying to convert larger models such as large-v3 the conversion fails but succeeds for smaller models. The "mlprogram" format is a more recent addition to CoreML and is designed to be more flexible and powerful, allowing for more complex models and larger data types. This seems to work for larger and smaller models alike and unless I'm there are considerations that I'm not aware of I think this is what we should be using moving forward. The error that is generated for large models is the following: ```console Running MIL backend_neuralnetwork pipeline: 100%|█████████| 9/9 [00:00<00:00, 35.44 passes/s] Translating MIL ==> NeuralNetwork Ops: 100%|███████████| 5641/5641 [03:31<00:00, 26.65 ops/s] Traceback (most recent call last): File "/Users/danbev/work/ai/whisper-work/models/convert-whisper-to-coreml.py", line 322, in <module> encoder = convert_encoder(hparams, encoder, quantize=args.quantize) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/danbev/work/ai/whisper-work/models/convert-whisper-to-coreml.py", line 255, in convert_encoder model = ct.convert( ^^^^^^^^^^^ File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/converters/_converters_entry.py", line 635, in convert mlmodel = mil_convert( ^^^^^^^^^^^^ File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 186, in mil_convert return _mil_convert( ^^^^^^^^^^^^^ File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 245, in _mil_convert return modelClass( ^^^^^^^^^^^ File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/models/model.py", line 489, in __init__ self.__proxy__, self._spec, self._framework_error = self._get_proxy_and_spec( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/models/model.py", line 550, in _get_proxy_and_spec _MLModelProxy( ValueError: basic_string ``` Refs: https://github.com/ggml-org/whisper.cpp/issues/3012
ci : disable freeBSD job in build.yml (#3064 )
2025-04-24 13:06:09 +00:00 · 2025-04-23 08:24:38 +02:00 · 2025-04-22 11:07:54 +02:00 · 2025-04-20 19:40:25 +02:00 · 2025-04-17 18:49:58 +09:00 · 2025-04-16 06:24:38 +02:00
63 changed files with 14301 additions and 11045 deletions
--- a/.github/workflows/bindings-ruby.yml
+++ b/.github/workflows/bindings-ruby.yml
@ -1,55 +1,11 @@
 name: Bindings Tests (Ruby)
 on:
  push:
-    paths:
+    branches:
-      - bindings/ruby/**
+      - master
      - src/**/*.c
      - src/**/*.cpp
      - src/**/*.h
      - src/**/*.m
      - src/**/*.metal
      - include/**/*.c
      - include/**/*.cpp
      - include/**/*.h
      - include/**/*.m
      - include/**/*.metal
      - ggml/**/*.c
      - ggml/**/*.cpp
      - ggml/**/*.h
      - ggml/**/*.m
      - ggml/**/*.metal
      - scripts/get-flags.mk
      - examples/common.h
      - examples/common.cpp
      - examples/common-whisper.h
      - examples/common-whisper.cpp
      - examples/stb_vorbis.c
      - examples/miniaudio.h
  pull_request:
-    paths:
+    types: [opened, synchronize, reopened]
      - bindings/ruby/**
      - src/**/*.c
      - src/**/*.cpp
      - src/**/*.h
      - src/**/*.m
      - src/**/*.metal
      - include/**/*.c
      - include/**/*.cpp
      - include/**/*.h
      - include/**/*.m
      - include/**/*.metal
      - ggml/**/*.c
      - ggml/**/*.cpp
      - ggml/**/*.h
      - ggml/**/*.m
      - ggml/**/*.metal
      - scripts/get-flags.mk
      - examples/common.h
      - examples/common.cpp
      - examples/common-whisper.h
      - examples/common-whisper.cpp
      - examples/stb_vorbis.c
      - examples/miniaudio.h
 jobs:
  ubuntu-22:
@ -60,6 +16,6 @@ jobs:
    steps:
      - uses: ruby/setup-ruby@v1
        with:
-          ruby-version: '3.1'
+          ruby-version: '3.2'
      - uses: actions/checkout@v4
      - run: rake test
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@ -200,23 +200,23 @@ jobs:
          cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
-  freeBSD-latest:
+#  freeBSD-latest:
-    runs-on: macos-13
+#    runs-on: macos-13
-
+#
-    steps:
+#    steps:
-      - name: Clone
+#      - name: Clone
-        uses: actions/checkout@v4
+#        uses: actions/checkout@v4
-
+#
-      - name: Build
+#      - name: Build
-        uses: cross-platform-actions/action@v0.27.0
+#        uses: cross-platform-actions/action@v0.27.0
-        with:
+#        with:
-          operating_system: freebsd
+#          operating_system: freebsd
-          version: '14.2'
+#          version: '14.2'
-          run: |
+#          run: |
-            sudo pkg update
+#            sudo pkg update
-            sudo pkg install -y gmake sdl2 cmake git
+#            sudo pkg install -y gmake sdl2 cmake git
-            cmake -B build
+#            cmake -B build
-            cmake --build build --config Release
+#            cmake --build build --config Release
  ubuntu-22-gcc:
    if: ${{ github.event_name == 'push' || github.event_name == 'pull_request' ||
--- a/README.md
+++ b/README.md
@ -2,15 +2,12 @@
 ![whisper.cpp](https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg)
-[![Actions Status](https://github.com/ggerganov/whisper.cpp/workflows/CI/badge.svg)](https://github.com/ggerganov/whisper.cpp/actions)
+[![Actions Status](https://github.com/ggml-org/whisper.cpp/workflows/CI/badge.svg)](https://github.com/ggml-org/whisper.cpp/actions)
 [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
 [![Conan Center](https://shields.io/conan/v/whisper-cpp)](https://conan.io/center/whisper-cpp)
 [![npm](https://img.shields.io/npm/v/whisper.cpp.svg)](https://www.npmjs.com/package/whisper.cpp/)
-> [!NOTE]
+Stable: [v1.7.5](https://github.com/ggml-org/whisper.cpp/releases/tag/v1.7.5) / [Roadmap](https://github.com/orgs/ggml-org/projects/4/)
 > New maintenance roadmap: https://github.com/ggerganov/whisper.cpp/discussions/2788
 Stable: [v1.7.5](https://github.com/ggerganov/whisper.cpp/releases/tag/v1.7.5) / [Roadmap | F.A.Q.](https://github.com/ggerganov/whisper.cpp/discussions/126)
 High-performance inference of [OpenAI's Whisper](https://github.com/openai/whisper) automatic speech recognition (ASR) model:
@ -26,7 +23,7 @@ High-performance inference of [OpenAI's Whisper](https://github.com/openai/whisp
 - [Efficient GPU support for NVIDIA](#nvidia-gpu-support)
 - [OpenVINO Support](#openvino-support)
 - [Ascend NPU Support](#ascend-npu-support)
- [C-style API](https://github.com/ggerganov/whisper.cpp/blob/master/include/whisper.h)
+- [C-style API](https://github.com/ggml-org/whisper.cpp/blob/master/include/whisper.h)
 Supported platforms:
@ -34,14 +31,14 @@ Supported platforms:
 - [x] [iOS](examples/whisper.objc)
 - [x] [Android](examples/whisper.android)
 - [x] [Java](bindings/java/README.md)
- [x] Linux / [FreeBSD](https://github.com/ggerganov/whisper.cpp/issues/56#issuecomment-1350920264)
+- [x] Linux / [FreeBSD](https://github.com/ggml-org/whisper.cpp/issues/56#issuecomment-1350920264)
 - [x] [WebAssembly](examples/whisper.wasm)
- [x] Windows ([MSVC](https://github.com/ggerganov/whisper.cpp/blob/master/.github/workflows/build.yml#L117-L144) and [MinGW](https://github.com/ggerganov/whisper.cpp/issues/168)]
+- [x] Windows ([MSVC](https://github.com/ggml-org/whisper.cpp/blob/master/.github/workflows/build.yml#L117-L144) and [MinGW](https://github.com/ggml-org/whisper.cpp/issues/168)]
- [x] [Raspberry Pi](https://github.com/ggerganov/whisper.cpp/discussions/166)
+- [x] [Raspberry Pi](https://github.com/ggml-org/whisper.cpp/discussions/166)
- [x] [Docker](https://github.com/ggerganov/whisper.cpp/pkgs/container/whisper.cpp)
+- [x] [Docker](https://github.com/ggml-org/whisper.cpp/pkgs/container/whisper.cpp)
 The entire high-level implementation of the model is contained in [whisper.h](include/whisper.h) and [whisper.cpp](src/whisper.cpp).
-The rest of the code is part of the [`ggml`](https://github.com/ggerganov/ggml) machine learning library.
+The rest of the code is part of the [`ggml`](https://github.com/ggml-org/ggml) machine learning library.
 Having such a lightweight implementation of the model allows to easily integrate it in different platforms and applications.
 As an example, here is a video of running the model on an iPhone 13 device - fully offline, on-device: [whisper.objc](examples/whisper.objc)
@ -54,14 +51,14 @@ https://user-images.githubusercontent.com/1991296/204038393-2f846eae-c255-4099-a
 On Apple Silicon, the inference runs fully on the GPU via Metal:
-https://github.com/ggerganov/whisper.cpp/assets/1991296/c82e8f86-60dc-49f2-b048-d2fdbd6b5225
+https://github.com/ggml-org/whisper.cpp/assets/1991296/c82e8f86-60dc-49f2-b048-d2fdbd6b5225
 ## Quick start
 First clone the repository:
 ```bash
-git clone https://github.com/ggerganov/whisper.cpp.git
+git clone https://github.com/ggml-org/whisper.cpp.git
 ```
 Navigate into the directory:
@ -152,6 +149,7 @@ standard cmake setup with:
 cmake -B build -DGGML_BLAS=1
 cmake --build build --config Release
 ./build/bin/whisper-cli [ .. etc .. ]
 ```
 ## Quantization
@ -225,7 +223,7 @@ speed-up - more than x3 faster compared with CPU-only execution. Here are the in
  The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format.
  Next runs are faster.
-For more information about the Core ML implementation please refer to PR [#566](https://github.com/ggerganov/whisper.cpp/pull/566).
+For more information about the Core ML implementation please refer to PR [#566](https://github.com/ggml-org/whisper.cpp/pull/566).
 ## OpenVINO support
@ -310,7 +308,7 @@ This can result in significant speedup in encoder performance. Here are the inst
  The first time run on an OpenVINO device is slow, since the OpenVINO framework will compile the IR (Intermediate Representation) model to a device-specific 'blob'. This device-specific blob will get
  cached for the next run.
-For more information about the OpenVINO implementation please refer to PR [#1037](https://github.com/ggerganov/whisper.cpp/pull/1037).
+For more information about the OpenVINO implementation please refer to PR [#1037](https://github.com/ggml-org/whisper.cpp/pull/1037).
 ## NVIDIA GPU support
@ -324,6 +322,12 @@ cmake -B build -DGGML_CUDA=1
 cmake --build build -j --config Release
 ```
 or for newer NVIDIA GPU's (RTX 5000 series):
 ```
 cmake -B build -DGGML_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES="86"
 cmake --build build -j --config Release
 ```
 ## Vulkan GPU support
 Cross-vendor solution which allows you to accelerate workload on your GPU.
 First, make sure your graphics card driver provides support for Vulkan API.
@ -377,6 +381,37 @@ Run the inference examples as usual, for example:
 - If you have trouble with Ascend NPU device, please create a issue with **[CANN]** prefix/tag.
 - If you run successfully with your Ascend NPU device, please help update the table `Verified devices`.
 ## FFmpeg support (Linux only)
 If you want to support more audio formats (such as Opus and AAC), you can turn on the `WHISPER_FFMPEG` build flag to enable FFmpeg integration.
 First, you need to install required libraries:
 ```bash
 # Debian/Ubuntu
 sudo apt install libavcodec-dev libavformat-dev libavutil-dev
 # RHEL/Fedora
 sudo dnf install libavcodec-free-devel libavformat-free-devel libavutil-free-devel
 ```
 Then you can build the project as follows:
 ```bash
 cmake -B build -D WHISPER_FFMPEG=yes
 cmake --build build
 ```
 Run the following example to confirm it's working:
 ```bash
 # Convert an audio file to Opus format
 ffmpeg -i samples/jfk.wav jfk.opus
 # Transcribe the audio file
 ./build/bin/whisper-cli --model models/ggml-base.en.bin --file jfk.opus
 ```
 ## Docker
 ### Prerequisites
@ -388,8 +423,8 @@ Run the inference examples as usual, for example:
 We have two Docker images available for this project:
-1. `ghcr.io/ggerganov/whisper.cpp:main`: This image includes the main executable file as well as `curl` and `ffmpeg`. (platforms: `linux/amd64`, `linux/arm64`)
+1. `ghcr.io/ggml-org/whisper.cpp:main`: This image includes the main executable file as well as `curl` and `ffmpeg`. (platforms: `linux/amd64`, `linux/arm64`)
-2. `ghcr.io/ggerganov/whisper.cpp:main-cuda`: Same as `main` but compiled with CUDA support. (platforms: `linux/amd64`)
+2. `ghcr.io/ggml-org/whisper.cpp:main-cuda`: Same as `main` but compiled with CUDA support. (platforms: `linux/amd64`)
 ### Usage
@ -427,7 +462,7 @@ For detailed instructions on how to use Conan, please refer to the [Conan docume
 This is a naive example of performing real-time inference on audio from your microphone.
 The [stream](examples/stream) tool samples the audio every half a second and runs the transcription continuously.
-More info is available in [issue #10](https://github.com/ggerganov/whisper.cpp/issues/10). 
+More info is available in [issue #10](https://github.com/ggml-org/whisper.cpp/issues/10).
 You will need to have [sdl2](https://wiki.libsdl.org/SDL2/Installation) installed for it to work properly.
 ```bash
@ -516,7 +551,7 @@ main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 pr
 ## Speaker segmentation via tinydiarize (experimental)
-More information about this approach is available here: https://github.com/ggerganov/whisper.cpp/pull/1058
+More information about this approach is available here: https://github.com/ggml-org/whisper.cpp/pull/1058
 Sample usage:
@ -580,7 +615,7 @@ https://user-images.githubusercontent.com/1991296/199337538-b7b0c7a3-2753-4a88-a
 ## Video comparison of different models
-Use the [scripts/bench-wts.sh](https://github.com/ggerganov/whisper.cpp/blob/master/scripts/bench-wts.sh) script to generate a video in the following format:
+Use the [scripts/bench-wts.sh](https://github.com/ggml-org/whisper.cpp/blob/master/scripts/bench-wts.sh) script to generate a video in the following format:
 ```bash
 ./scripts/bench-wts.sh samples/jfk.wav
@ -597,7 +632,7 @@ In order to have an objective comparison of the performance of the inference acr
 use the [whisper-bench](examples/bench) tool. The tool simply runs the Encoder part of the model and prints how much time it
 took to execute it. The results are summarized in the following Github issue:
-[Benchmark results](https://github.com/ggerganov/whisper.cpp/issues/89)
+[Benchmark results](https://github.com/ggml-org/whisper.cpp/issues/89)
 Additionally a script to run whisper.cpp with different models and audio files is provided [bench.py](scripts/bench.py).
@ -624,25 +659,24 @@ You can download the converted models using the [models/download-ggml-model.sh](
 or manually from here:
 - https://huggingface.co/ggerganov/whisper.cpp
 - https://ggml.ggerganov.com
 For more details, see the conversion script [models/convert-pt-to-ggml.py](models/convert-pt-to-ggml.py) or [models/README.md](models/README.md).
-## [Bindings](https://github.com/ggerganov/whisper.cpp/discussions/categories/bindings)
+## [Bindings](https://github.com/ggml-org/whisper.cpp/discussions/categories/bindings)
- [x] Rust: [tazz4843/whisper-rs](https://github.com/tazz4843/whisper-rs) | [#310](https://github.com/ggerganov/whisper.cpp/discussions/310)
+- [x] Rust: [tazz4843/whisper-rs](https://github.com/tazz4843/whisper-rs) | [#310](https://github.com/ggml-org/whisper.cpp/discussions/310)
- [x] JavaScript: [bindings/javascript](bindings/javascript) | [#309](https://github.com/ggerganov/whisper.cpp/discussions/309)
+- [x] JavaScript: [bindings/javascript](bindings/javascript) | [#309](https://github.com/ggml-org/whisper.cpp/discussions/309)
  - React Native (iOS / Android): [whisper.rn](https://github.com/mybigday/whisper.rn)
- [x] Go: [bindings/go](bindings/go) | [#312](https://github.com/ggerganov/whisper.cpp/discussions/312)
+- [x] Go: [bindings/go](bindings/go) | [#312](https://github.com/ggml-org/whisper.cpp/discussions/312)
 - [x] Java:
  - [GiviMAD/whisper-jni](https://github.com/GiviMAD/whisper-jni)
- [x] Ruby: [bindings/ruby](bindings/ruby) | [#507](https://github.com/ggerganov/whisper.cpp/discussions/507)
+- [x] Ruby: [bindings/ruby](bindings/ruby) | [#507](https://github.com/ggml-org/whisper.cpp/discussions/507)
- [x] Objective-C / Swift: [ggerganov/whisper.spm](https://github.com/ggerganov/whisper.spm) | [#313](https://github.com/ggerganov/whisper.cpp/discussions/313)
+- [x] Objective-C / Swift: [ggml-org/whisper.spm](https://github.com/ggml-org/whisper.spm) | [#313](https://github.com/ggml-org/whisper.cpp/discussions/313)
  - [exPHAT/SwiftWhisper](https://github.com/exPHAT/SwiftWhisper)
- [x] .NET: | [#422](https://github.com/ggerganov/whisper.cpp/discussions/422)
+- [x] .NET: | [#422](https://github.com/ggml-org/whisper.cpp/discussions/422)
  - [sandrohanea/whisper.net](https://github.com/sandrohanea/whisper.net)
  - [NickDarvey/whisper](https://github.com/NickDarvey/whisper)
- [x] Python: | [#9](https://github.com/ggerganov/whisper.cpp/issues/9)
+- [x] Python: | [#9](https://github.com/ggml-org/whisper.cpp/issues/9)
  - [stlukey/whispercpp.py](https://github.com/stlukey/whispercpp.py) (Cython)
  - [AIWintermuteAI/whispercpp](https://github.com/AIWintermuteAI/whispercpp) (Updated fork of aarnphm/whispercpp)
  - [aarnphm/whispercpp](https://github.com/aarnphm/whispercpp) (Pybind11)
@ -650,6 +684,33 @@ For more details, see the conversion script [models/convert-pt-to-ggml.py](model
 - [x] R: [bnosac/audio.whisper](https://github.com/bnosac/audio.whisper)
 - [x] Unity: [macoron/whisper.unity](https://github.com/Macoron/whisper.unity)
 ## XCFramework
 The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS,
 and macOS. It can be used in Swift projects without the need to compile the
 library from source. For examples:
 ```swift
 // swift-tools-version: 5.10
 // The swift-tools-version declares the minimum version of Swift required to build this package.
 import PackageDescription
 let package = Package(
    name: "Whisper",
    targets: [
        .executableTarget(
            name: "Whisper",
            dependencies: [
                "WhisperFramework"
            ]),
        .binaryTarget(
            name: "WhisperFramework",
            url: "https://github.com/ggml-org/whisper.cpp/releases/download/v1.7.5/whisper-v1.7.5-xcframework.zip",
            checksum: "c7faeb328620d6012e130f3d705c51a6ea6c995605f2df50f6e1ad68c59c6c4a"
        )
    ]
 )
 ```
 ## Examples
 There are various examples of using the library for different projects in the [examples](examples) folder.
@ -668,13 +729,13 @@ Some of the examples are even ported to run in the browser using WebAssembly. Ch
 | [whisper.android](examples/whisper.android)         |                                       | Android mobile application using whisper.cpp                                                                                    |
 | [whisper.nvim](examples/whisper.nvim)               |                                       | Speech-to-text plugin for Neovim                                                                                                |
 | [generate-karaoke.sh](examples/generate-karaoke.sh) |                                       | Helper script to easily [generate a karaoke video](https://youtu.be/uj7hVta4blM) of raw audio capture                           |
-| [livestream.sh](examples/livestream.sh)             |                                       | [Livestream audio transcription](https://github.com/ggerganov/whisper.cpp/issues/185)                                           |
+| [livestream.sh](examples/livestream.sh)             |                                       | [Livestream audio transcription](https://github.com/ggml-org/whisper.cpp/issues/185)                                            |
 | [yt-wsp.sh](examples/yt-wsp.sh)                     |                                       | Download + transcribe and/or translate any VOD [(original)](https://gist.github.com/DaniruKun/96f763ec1a037cc92fe1a059b643b818) |
 | [wchess](examples/wchess)                           | [wchess.wasm](examples/wchess)        | Voice-controlled chess                                                                                                          |
-## [Discussions](https://github.com/ggerganov/whisper.cpp/discussions)
+## [Discussions](https://github.com/ggml-org/whisper.cpp/discussions)
 If you have any kind of feedback about this project feel free to use the Discussions section and open a new topic.
-You can use the [Show and tell](https://github.com/ggerganov/whisper.cpp/discussions/categories/show-and-tell) category
+You can use the [Show and tell](https://github.com/ggml-org/whisper.cpp/discussions/categories/show-and-tell) category
 to share your own projects that use `whisper.cpp`. If you have a question, make sure to check the
-[Frequently asked questions (#126)](https://github.com/ggerganov/whisper.cpp/discussions/126) discussion.
+[Frequently asked questions (#126)](https://github.com/ggml-org/whisper.cpp/discussions/126) discussion.
--- a/bindings/go/README.md
+++ b/bindings/go/README.md
@ -51,7 +51,7 @@ func main() {
 In order to build, you need to have the Go compiler installed. You can get it from [here](https://golang.org/dl/). Run the tests with:
 ```bash
-git clone https://github.com/ggerganov/whisper.cpp.git
+git clone https://github.com/ggml-org/whisper.cpp.git
 cd whisper.cpp/bindings/go
 make test
 ```
@ -98,7 +98,7 @@ The API Documentation:
 Getting help:
-  * Follow the discussion for the go bindings [here](https://github.com/ggerganov/whisper.cpp/discussions/312)
+  * Follow the discussion for the go bindings [here](https://github.com/ggml-org/whisper.cpp/discussions/312)
 ## License
--- a/bindings/go/doc.go
+++ b/bindings/go/doc.go
@ -1,5 +1,5 @@
 /*
-github.com/ggerganov/whisper.cpp/bindings/go
+github.com/ggml-org/whisper.cpp/bindings/go
 provides a speech-to-text service bindings for the Go programming language.
 */
 package whisper
--- a/bindings/java/README.md
+++ b/bindings/java/README.md
@ -52,7 +52,7 @@ public class Example {
 In order to build, you need to have the JDK 8 or higher installed. Run the tests with:
 ```bash
-git clone https://github.com/ggerganov/whisper.cpp.git
+git clone https://github.com/ggml-org/whisper.cpp.git
 cd whisper.cpp/bindings/java
 ./gradlew build
--- a/bindings/ruby/.gitignore
+++ b/bindings/ruby/.gitignore
@ -1,3 +1,6 @@
 LICENSE
 pkg/
 lib/whisper.*
 ext/sources/*
 !ext/sources/CMakeGraphVizOptions.cmake
 ext/mkmf.log
--- a/bindings/ruby/README.md
+++ b/bindings/ruby/README.md
@ -16,6 +16,18 @@ If bundler is not being used to manage dependencies, install the gem by executin
    $ gem install whispercpp
 You can pass build options for whisper.cpp, for instance:
    $ bundle config build.whispercpp --enable-ggml-cuda
 or,
    $ gem install whispercpp -- --enable-ggml-cuda
 See whisper.cpp's [README](https://github.com/ggml-org/whisper.cpp/blob/master/README.md) for available options. You need convert options present the README to Ruby-style options.  
 For boolean options like `GGML_CUDA`, the README says `-DGGML_CUDA=1`. You need strip `-D`, prepend `--enable-` for `1` or `ON` (`--disable-` for `0` or `OFF`) and make it kebab-case: `--enable-ggml-cuda`.  
 For options which require arguments like `CMAKE_CUDA_ARCHITECTURES`, the README says `-DCMAKE_CUDA_ARCHITECTURES="86"`. You need strip `-D`, prepend `--`, make it kebab-case, append `=` and append argument: `--cmake-cuda-architectures="86"`.
 Usage
 -----
@ -228,7 +240,7 @@ The second argument `samples` may be an array, an object with `length` and `each
 Development
 -----------
-    % git clone https://github.com/ggerganov/whisper.cpp.git
+    % git clone https://github.com/ggml-org/whisper.cpp.git
    % cd whisper.cpp/bindings/ruby
    % rake test
@ -241,5 +253,5 @@ License
 The same to [whisper.cpp][].
-[whisper.cpp]: https://github.com/ggerganov/whisper.cpp
+[whisper.cpp]: https://github.com/ggml-org/whisper.cpp
-[models]: https://github.com/ggerganov/whisper.cpp/tree/master/models
+[models]: https://github.com/ggml-org/whisper.cpp/tree/master/models
--- a/bindings/ruby/Rakefile
+++ b/bindings/ruby/Rakefile
@ -3,11 +3,15 @@ require "bundler/gem_tasks"
 require "rake/testtask"
 require_relative "extsources"
 SOURCES_DIR = "ext/sources"
 SOURCES = FileList[]
 EXTSOURCES.each do |src|
  basename = src.pathmap("%f")
-  dest = basename == "LICENSE" ? basename : src.pathmap("%{../..,ext}p")
+  dest = basename == "LICENSE" ? basename
                               : src.pathmap("%{\\.\\./\\.\\.,#{SOURCES_DIR}}p")
                                    .pathmap("%{\\.\\./javascript,#{SOURCES_DIR}/bindings/javascript}p")
  dir = dest.pathmap("%d")
  file src
  directory dir
@ -18,7 +22,6 @@ EXTSOURCES.each do |src|
 end
 CLEAN.include SOURCES
 CLEAN.include FileList["ext/**/*.o", "ext/**/*.metal", "ext/**/*.tmp", "ext/whisper.{so,bundle,dll}"]
 SRC = FileList["ext/*.{c,cpp,h}"]
@ -36,6 +39,20 @@ file "ext/Makefile" => SRC + ["ext/extconf.rb"] + SOURCES do |t|
    ruby "extconf.rb"
  end
 end
 if File.exist? "ext/Makefile"
  task :make_clean do
    cd "ext" do
      sh "make", "clean"
    end
  end
  task clean: :make_clean
  task :make_distclean do
    cd "ext" do
      sh "make", "distclean"
    end
  end
  task clobber: :make_distclean
 end
 file SO_FILE => "ext/Makefile" do |t|
  chdir "ext" do
--- a/bindings/ruby/ext/cpu.mk
+++ b/bindings/ruby/ext/cpu.mk
@ -1,11 +0,0 @@
 ggml/src/ggml-cpu/ggml-cpu-cpp.o: \
 	ggml/src/ggml-cpu/ggml-cpu.cpp \
 	ggml/src/ggml-cpu/unary-ops.cpp \
 	ggml/src/ggml-cpu/binary-ops.cpp \
 	ggml/include/ggml-backend.h \
 	ggml/include/ggml.h \
 	ggml/include/ggml-alloc.h \
 	ggml/src/ggml-backend-impl.h \
 	ggml/include/ggml-cpu.h \
 	ggml/src/ggml-impl.h
 	$(CXX) $(CXXFLAGS)   -c $< -o $@
--- a/bindings/ruby/ext/dependencies.rb
+++ b/bindings/ruby/ext/dependencies.rb
@ -0,0 +1,61 @@
 require "tsort"
 class Dependencies
  def initialize(cmake, options)
    @cmake = cmake
    @options = options
    generate_dot
    @libs = parse_dot
  end
  def to_s
    @libs.join(" ")
  end
  private
  def dot_path
    File.join(__dir__, "build", "whisper.cpp.dot")
  end
  def generate_dot
    system @cmake, "-S", "sources", "-B", "build", "--graphviz", dot_path, "-D", "BUILD_SHARED_LIBS=OFF", @options.to_s, exception: true
  end
  def parse_dot
    static_lib_shape = nil
    nodes = {}
    depends = Hash.new {|h, k| h[k] = []}
    class << depends
      include TSort
      alias tsort_each_node each_key
      def tsort_each_child(node, &block)
        fetch(node, []).each(&block)
      end
    end
    File.open(dot_path).each_line do |line|
      case line
      when /\[\s*label\s*=\s*"Static Library"\s*,\s*shape\s*=\s*(?<shape>\w+)\s*\]/
        static_lib_shape = $~[:shape]
      when /\A\s*"(?<node>\w+)"\s*\[\s*label\s*=\s*"(?<label>\S+)"\s*,\s*shape\s*=\s*(?<shape>\w+)\s*\]\s*;\s*\z/
        node = $~[:node]
        label = $~[:label]
        shape = $~[:shape]
        nodes[node] = [label, shape]
      when /\A\s*"(?<depender>\w+)"\s*->\s*"(?<dependee>\w+)"/
        depender = $~[:depender]
        dependee = $~[:dependee]
        depends[depender] ||= []
        depends[depender] << dependee
      end
    end
    depends.tsort.filter_map {|node|
      label, shape = nodes[node]
      shape == static_lib_shape ? label : nil
    }.collect {|lib| "lib#{lib}.a"}
      .reverse
  end
 end
--- a/bindings/ruby/ext/extconf.rb
+++ b/bindings/ruby/ext/extconf.rb
@ -1,210 +1,22 @@
-require 'mkmf'
+require "mkmf"
 require_relative "options"
 require_relative "dependencies"
-# need to use c++ compiler flags
+cmake = find_executable("cmake") || abort
-$CXXFLAGS << ' -std=c++17'
+options = Options.new
 have_library("gomp") rescue nil
 libs = Dependencies.new(cmake, options)
-$LDFLAGS << ' -lstdc++'
+$INCFLAGS << " -Isources/include -Isources/ggml/include -Isources/examples"
 $LOCAL_LIBS << " #{libs}"
 $cleanfiles << " build #{libs}"
-# Set to true when building binary gems
+create_makefile "whisper" do |conf|
-if enable_config('static-stdlib', false)
+  conf << <<~EOF
-  $LDFLAGS << ' -static-libgcc -static-libstdc++'
+    $(TARGET_SO): #{libs}
-end
+    #{libs}: cmake-targets
-
+    cmake-targets:
-if enable_config('march-tune-native', false)
+    #{"\t"}#{cmake} -S sources -B build -D BUILD_SHARED_LIBS=OFF -D CMAKE_ARCHIVE_OUTPUT_DIRECTORY=#{__dir__} -D CMAKE_POSITION_INDEPENDENT_CODE=ON #{options}
-  $CFLAGS << ' -march=native -mtune=native'
+    #{"\t"}#{cmake} --build build --config Release --target common whisper
-  $CXXFLAGS << ' -march=native -mtune=native'
+  EOF
 end
 if ENV['WHISPER_METAL']
  $GGML_METAL ||= true
  $DEPRECATE_WARNING ||= true
 end
 $UNAME_S = `uname -s`.chomp
 $UNAME_P = `uname -p`.chomp
 $UNAME_M = `uname -m`.chomp
 if $UNAME_S == 'Darwin'
  unless ENV['GGML_NO_METAL']
    $GGML_METAL ||= true
  end
  $GGML_NO_OPENMP ||= true
 end
 if $GGML_METAL
  $GGML_METAL_EMBED_LIBRARY = true
 end
 $MK_CPPFLAGS = '-Iggml/include -Iggml/src -Iggml/src/ggml-cpu -Iinclude -Isrc -Iexamples -DGGML_USE_CPU'
 $MK_CFLAGS   = '-std=c11   -fPIC'
 $MK_CXXFLAGS = '-std=c++17 -fPIC'
 $MK_NVCCFLAGS = '-std=c++17'
 $MK_LDFLAGS = ''
 $OBJ_GGML = []
 $OBJ_WHISPER = []
 $OBJ_COMMON = []
 $OBJ_SDL = []
 $MK_CPPFLAGS << ' -D_XOPEN_SOURCE=600'
 if $UNAME_S == 'Linux'
  $MK_CPPFLAGS << ' -D_GNU_SOURCE'
 end
 if $UNAME_S == 'Darwin'
  $MK_CPPFLAGS << ' -D_DARWIN_C_SOURCE'
 end
 if ENV['WHISPER_DEBUG']
  $MK_CFLAGS    << ' -O0 -g'
  $MK_CXXFLAGS  << ' -O0 -g'
  $MK_LDFLAGS   << ' -g'
  $MK_NVCCFLAGS << ' -O0 -g'
 else
  $MK_CPPFLAGS   << ' -DNDEBUG'
  $MK_CFLAGS     << ' -O3'
  $MK_CXXFLAGS   << ' -O3'
  $MK_NVCCFLAGS  << ' -O3'
 end
 $WARN_FLAGS =
  ' -Wall' <<
  ' -Wextra' <<
  ' -Wpedantic' <<
  ' -Wcast-qual' <<
  ' -Wno-unused-function'
 $MK_CFLAGS <<
  $WARN_FLAGS <<
  ' -Wshadow' <<
  ' -Wstrict-prototypes' <<
  ' -Wpointer-arith' <<
  ' -Wmissing-prototypes' <<
  ' -Werror=implicit-int' <<
  ' -Werror=implicit-function-declaration'
 $MK_CXXFLAGS <<
  $WARN_FLAGS <<
  ' -Wmissing-declarations' <<
  ' -Wmissing-noreturn'
 unless `#{cc_command} #{$LDFLAGS} -Wl,-v 2>&1`.chomp.include? 'dyld-1015.7'
  $MK_CPPFLAGS << ' -DHAVE_BUGGY_APPLE_LINKER'
 end
 if %w[Linux Darwin FreeBSD NetBSD OpenBSD Haiku].include? $UNAME_S
  $MK_CFLAGS   << ' -pthread'
  $MK_CXXFLAGS << ' -pthread'
 end
 unless $_WIN32
  $DSO_EXT = '.so'
 else
  $DSO_EXT = '.dll'
 end
 unless ENV['RISCV']
  if %w[x86_64 i686 amd64].include? $UNAME_M
    $HOST_CXXFLAGS ||= ''
    $MK_CFLAGS     << ' -march=native -mtune=native'
    $HOST_CXXFLAGS << ' -march=native -mtune=native'
  end
 else
  $MK_CFLAGS   << ' -march=rv64gcv -mabi=lp64d'
  $MK_CXXFLAGS << ' -march=rv64gcv -mabi=lp64d'
 end
 unless ENV['GGML_NO_ACCELERATE']
  if $UNAME_S == 'Darwin'
    $MK_CPPFLAGS << ' -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DGGML_BLAS_USE_ACCELERATE'
    $MK_CPPFLAGS << ' -DACCELERATE_NEW_LAPACK'
    $MK_CPPFLAGS << ' -DACCELERATE_LAPACK_ILP64'
    $MK_LDFLAGS  << ' -framework Accelerate'
    $OBJ_GGML    << 'ggml/src/ggml-blas/ggml-blas.o'
  end
 end
 if ENV['GGML_OPENBLAS']
  $MK_CPPFLAGS << " -DGGML_USE_BLAS #{`pkg-config --cflags-only-I openblas`.chomp}"
  $MK_CFLAGS   << " #{`pkg-config --cflags-only-other openblas)`.chomp}"
  $MK_LDFLAGS  << " #{`pkg-config --libs openblas`}"
  $OBJ_GGML    << 'ggml/src/ggml-blas/ggml-blas.o'
 end
 if ENV['GGML_OPENBLAS64']
  $MK_CPPFLAGS << " -DGGML_USE_BLAS #{`pkg-config --cflags-only-I openblas64`.chomp}"
  $MK_CFLAGS   << " #{`pkg-config --cflags-only-other openblas64)`.chomp}"
  $MK_LDFLAGS  << " #{`pkg-config --libs openblas64`}"
  $OBJ_GGML    << 'ggml/src/ggml-blas/ggml-blas.o'
 end
 if $GGML_METAL
  $MK_CPPFLAGS << ' -DGGML_USE_METAL'
  $MK_LDFLAGS  << ' -framework Foundation -framework Metal -framework MetalKit'
  $OBJ_GGML    << 'ggml/src/ggml-metal/ggml-metal.o'
  if ENV['GGML_METAL_NDEBUG']
    $MK_CPPFLAGS << ' -DGGML_METAL_NDEBUG'
  end
  if $GGML_METAL_EMBED_LIBRARY
    $MK_CPPFLAGS << ' -DGGML_METAL_EMBED_LIBRARY'
    $OBJ_GGML    << 'ggml/src/ggml-metal/ggml-metal-embed.o'
  end
 end
 $OBJ_GGML <<
  'ggml/src/ggml.o' <<
  'ggml/src/ggml-alloc.o' <<
  'ggml/src/ggml-backend.o' <<
  'ggml/src/ggml-backend-reg.o' <<
  'ggml/src/ggml-opt.o' <<
  'ggml/src/ggml-quants.o' <<
  'ggml/src/ggml-threading.o' <<
  'ggml/src/ggml-cpu/ggml-cpu.o' <<
  'ggml/src/ggml-cpu/ggml-cpu-cpp.o' <<
  'ggml/src/ggml-cpu/ggml-cpu-aarch64.o' <<
  'ggml/src/ggml-cpu/ggml-cpu-hbm.o' <<
  'ggml/src/ggml-cpu/ggml-cpu-quants.o' <<
  'ggml/src/ggml-cpu/ggml-cpu-traits.o' <<
  'ggml/src/ggml-cpu/unary-ops.o' <<
  'ggml/src/ggml-cpu/binary-ops.o'
 $OBJ_WHISPER <<
  'src/whisper.o' <<
  'examples/common.o' <<
  'examples/common-whisper.o'
 $objs = $OBJ_GGML + $OBJ_WHISPER + $OBJ_COMMON + $OBJ_SDL
 $objs <<
  "ruby_whisper.o" <<
  "ruby_whisper_context.o" <<
  "ruby_whisper_transcribe.o" <<
  "ruby_whisper_params.o" <<
  "ruby_whisper_error.o" <<
  "ruby_whisper_segment.o" <<
  "ruby_whisper_model.o"
 $CPPFLAGS  = "#{$MK_CPPFLAGS} #{$CPPFLAGS}"
 $CFLAGS    = "#{$CPPFLAGS} #{$MK_CFLAGS} #{$GF_CFLAGS} #{$CFLAGS}"
 $BASE_CXXFLAGS = "#{$MK_CXXFLAGS} #{$CXXFLAGS}"
 $CXXFLAGS  = "#{$BASE_CXXFLAGS} #{$HOST_CXXFLAGS} #{$GF_CXXFLAGS} #{$CPPFLAGS}"
 $NVCCFLAGS = "#{$MK_NVCCFLAGS} #{$NVCCFLAGS}"
 $LDFLAGS   = "#{$MK_LDFLAGS} #{$LDFLAGS}"
 create_makefile('whisper')
 File.open 'Makefile', 'a' do |file|
  file.puts 'include scripts/get-flags.mk'
  file.puts 'include cpu.mk'
  if $GGML_METAL
    file.puts 'include metal.mk'
    if $GGML_METAL_EMBED_LIBRARY
      file.puts 'include metal-embed.mk'
    end
  end
 end
--- a/bindings/ruby/ext/metal-embed.mk
+++ b/bindings/ruby/ext/metal-embed.mk
@ -1,17 +0,0 @@
 ggml/src/ggml-metal/ggml-metal-embed.o: \
 	ggml/src/ggml-metal/ggml-metal.metal \
 	ggml/src/ggml-metal/ggml-metal-impl.h \
 	ggml/src/ggml-common.h
 	@echo "Embedding Metal library"
 	@sed -e '/__embed_ggml-common.h__/r      ggml/src/ggml-common.h'                -e '/__embed_ggml-common.h__/d'      < ggml/src/ggml-metal/ggml-metal.metal           > ggml/src/ggml-metal/ggml-metal-embed.metal.tmp
 	@sed -e '/#include "ggml-metal-impl.h"/r ggml/src/ggml-metal/ggml-metal-impl.h' -e '/#include "ggml-metal-impl.h"/d' < ggml/src/ggml-metal/ggml-metal-embed.metal.tmp > ggml/src/ggml-metal/ggml-metal-embed.metal
 	$(eval TEMP_ASSEMBLY=$(shell mktemp -d))
 	@echo ".section __DATA, __ggml_metallib"                       >  $(TEMP_ASSEMBLY)/ggml-metal-embed.s
 	@echo ".globl _ggml_metallib_start"                            >> $(TEMP_ASSEMBLY)/ggml-metal-embed.s
 	@echo "_ggml_metallib_start:"                                  >> $(TEMP_ASSEMBLY)/ggml-metal-embed.s
 	@echo ".incbin \"ggml/src/ggml-metal/ggml-metal-embed.metal\"" >> $(TEMP_ASSEMBLY)/ggml-metal-embed.s
 	@echo ".globl _ggml_metallib_end"                              >> $(TEMP_ASSEMBLY)/ggml-metal-embed.s
 	@echo "_ggml_metallib_end:"                                    >> $(TEMP_ASSEMBLY)/ggml-metal-embed.s
 	$(CC) $(CFLAGS) -c $(TEMP_ASSEMBLY)/ggml-metal-embed.s -o $@
 	@rm -f ${TEMP_ASSEMBLY}/ggml-metal-embed.s
 	@rmdir ${TEMP_ASSEMBLY}
--- a/bindings/ruby/ext/metal.mk
+++ b/bindings/ruby/ext/metal.mk
@ -1,6 +0,0 @@
 ggml/src/ggml-metal/ggml-metal.o: \
 	ggml/src/ggml-metal/ggml-metal.m \
 	ggml/src/ggml-metal/ggml-metal-impl.h \
 	ggml/include/ggml-metal.h \
 	ggml/include/ggml.h
 	$(CC) $(CFLAGS) -c $< -o $@
--- a/bindings/ruby/ext/options.rb
+++ b/bindings/ruby/ext/options.rb
@ -0,0 +1,219 @@
 class Options
  def initialize
    @options = {}
    @pending_options = []
    @ignored_options = []
    configure
  end
  def help
    @options
      .collect_concat {|name, (type, value)|
        option = option_name(name)
        if type == :bool
          ["--enable-#{option}", "--disable-#{option}"]
        else
          "--#{option}=#{type.upcase}"
        end
      }
      .join($/)
  end
  def to_s
    @options
      .reject {|name, (type, value)| value.nil?}
      .collect {|name, (type, value)| "-D #{name}=#{value == true ? "ON" : value == false ? "OFF" : value.shellescape}"}
      .join(" ")
  end
  def cmake_options
    return @cmake_options if @cmake_options
    output = nil
    Dir.chdir __dir__ do
      output = `cmake -S sources -B build -L`
    end
    started = false
    @cmake_options = output.lines.filter_map {|line|
      if line.chomp == "-- Cache values"
        started = true
        next
      end
      next unless started
      option, value = line.chomp.split("=", 2)
      name, type = option.split(":", 2)
      [name, type, value]
    }
  end
  def missing_options
    cmake_options.collect {|name, type, value| name} -
      @options.keys - @pending_options - @ignored_options
  end
  def extra_options
    @options.keys + @pending_options - @ignored_options -
      cmake_options.collect {|name, type, value| name}
  end
  private
  def configure
    filepath "ACCELERATE_FRAMEWORK"
    ignored "BUILD_SHARED_LIBS"
    ignored "BUILD_TESTING"
    ignored "CMAKE_BUILD_TYPE"
    ignored "CMAKE_INSTALL_PREFIX"
    string "CMAKE_OSX_ARCHITECTURES"
    ignored "CMAKE_OSX_DEPLOYMENT_TARGET"
    string "CMAKE_OSX_SYSROOT"
    filepath "FOUNDATION_LIBRARY"
    bool "GGML_ACCELERATE"
    bool "GGML_ALL_WARNINGS_3RD_PARTY"
    bool "GGML_AMX_BF16"
    bool "GGML_AMX_INT8"
    bool "GGML_AMX_TILE"
    bool "GGML_AVX"
    bool "GGML_AVX2"
    bool "GGML_AVX512"
    bool "GGML_AVX512_BF16"
    bool "GGML_AVX512_VBMI"
    bool "GGML_AVX512_VNNI"
    bool "GGML_AVX_VNNI"
    ignored "GGML_BACKEND_DL"
    ignored "GGML_BIN_INSTALL_DIR"
    bool "GGML_BLAS"
    string "GGML_BLAS_VENDOR"
    bool "GGML_BMI2"
    ignored "GGML_BUILD_EXAMPLES"
    ignored "GGML_BUILD_TESTS"
    filepath "GGML_CCACHE_FOUND"
    bool "GGML_CPU"
    bool "GGML_CPU_AARCH64"
    ignored "GGML_CPU_ALL_VARIANTS"
    string "GGML_CPU_ARM_ARCH"
    bool "GGML_CPU_HBM"
    bool "GGML_CPU_KLEIDIAI"
    string "GGML_CPU_POWERPC_CPUTYPE"
    bool "GGML_CUDA"
    string "GGML_CUDA_COMPRESSION_MODE"
    bool "GGML_CUDA_F16"
    bool "GGML_CUDA_FA"
    bool "GGML_CUDA_FA_ALL_QUANTS"
    bool "GGML_CUDA_FORCE_CUBLAS"
    bool "GGML_CUDA_FORCE_MMQ"
    ignored "GGML_CUDA_GRAPHS"
    bool "GGML_CUDA_NO_PEER_COPY"
    bool "GGML_CUDA_NO_VMM"
    string "GGML_CUDA_PEER_MAX_BATCH_SIZE"
    bool "GGML_F16C"
    bool "GGML_FMA"
    bool "GGML_GPROF"
    bool "GGML_HIP"
    bool "GGML_HIP_GRAPHS"
    bool "GGML_HIP_NO_VMM"
    bool "GGML_HIP_ROCWMMA_FATTN"
    bool "GGML_HIP_UMA"
    ignored "GGML_INCLUDE_INSTALL_DIR"
    bool "GGML_KOMPUTE"
    bool "GGML_LASX"
    ignored "GGML_LIB_INSTALL_DIR"
    ignored "GGML_LLAMAFILE"
    bool "GGML_LSX"
    bool "GGML_LTO"
    bool "GGML_METAL"
    bool "GGML_METAL_EMBED_LIBRARY"
    string "GGML_METAL_MACOSX_VERSION_MIN"
    bool "GGML_METAL_NDEBUG"
    bool "GGML_METAL_SHADER_DEBUG"
    string "GGML_METAL_STD"
    bool "GGML_METAL_USE_BF16"
    bool "GGML_MUSA"
    bool "GGML_NATIVE"
    bool "GGML_OPENCL"
    bool "GGML_OPENCL_EMBED_KERNELS"
    bool "GGML_OPENCL_PROFILING"
    string "GGML_OPENCL_TARGET_VERSION"
    bool "GGML_OPENCL_USE_ADRENO_KERNELS"
    bool "GGML_OPENMP"
    bool "GGML_RPC"
    bool "GGML_RVV"
    bool "GGML_RV_ZFH"
    pending "GGML_SCCACHE_FOUND"
    string "GGML_SCHED_MAX_COPIES"
    ignored "GGML_STATIC"
    bool "GGML_SYCL"
    string "GGML_SYCL_DEVICE_ARCH"
    bool "GGML_SYCL_F16"
    bool "GGML_SYCL_GRAPH"
    string "GGML_SYCL_TARGET"
    bool "GGML_VULKAN"
    bool "GGML_VULKAN_CHECK_RESULTS"
    bool "GGML_VULKAN_DEBUG"
    bool "GGML_VULKAN_MEMORY_DEBUG"
    bool "GGML_VULKAN_PERF"
    ignored "GGML_VULKAN_RUN_TESTS"
    filepath "GGML_VULKAN_SHADERS_GEN_TOOLCHAIN"
    bool "GGML_VULKAN_SHADER_DEBUG_INFO"
    pending "GGML_VULKAN_VALIDATE"
    bool "GGML_VXE"
    filepath "GIT_EXE"
    filepath "MATH_LIBRARY"
    filepath "METALKIT_FRAMEWORK"
    filepath "METAL_FRAMEWORK"
    bool "WHISPER_ALL_WARNINGS"
    bool "WHISPER_ALL_WARNINGS_3RD_PARTY"
    ignored "WHISPER_BIN_INSTALL_DIR"
    ignored "WHISPER_BUILD_EXAMPLES"
    ignored "WHISPER_BUILD_SERVER"
    ignored"WHISPER_BUILD_TESTS"
    bool "WHISPER_CCACHE"
    bool "WHISPER_COREML"
    bool "WHISPER_COREML_ALLOW_FALLBACK"
    ignored "WHISPER_CURL"
    bool "WHISPER_FATAL_WARNINGS"
    ignored "WHISPER_FFMPEG"
    ignored "WHISPER_INCLUDE_INSTALL_DIR"
    ignored "WHISPER_LIB_INSTALL_DIR"
    bool "WHISPER_OPENVINO"
    bool "WHISPER_SANITIZE_ADDRESS"
    bool "WHISPER_SANITIZE_THREAD"
    bool "WHISPER_SANITIZE_UNDEFINED"
    ignored "WHISPER_SDL2"
    pending "WHISPER_USE_SYSTEM_GGML"
  end
  def option_name(name)
    name.downcase.gsub("_", "-")
  end
  def bool(name)
    option = option_name(name)
    value = enable_config(option)
    @options[name] = [:bool, value]
  end
  def string(name, type=:string)
    option = "--#{option_name(name)}"
    value = arg_config(option)
    raise "String expected for #{option}" if value == true || value&.empty?
    @options[name] = [type, value]
  end
  def path(name)
    string(name, :path)
  end
  def filepath(name)
    string(name, :filepath)
  end
  def pending(name)
    @pending_options << name
  end
  def ignored(name)
    @ignored_options << name
  end
 end
--- a/bindings/ruby/ext/ruby_whisper_params.c
+++ b/bindings/ruby/ext/ruby_whisper_params.c
@ -918,7 +918,7 @@ ruby_whisper_params_initialize(int argc, VALUE *argv, VALUE self)
    return self;
  }
-  rb_get_kwargs(kw_hash, &param_names, 0, RUBY_WHISPER_PARAMS_PARAM_NAMES_COUNT, &values);
+  rb_get_kwargs(kw_hash, param_names, 0, RUBY_WHISPER_PARAMS_PARAM_NAMES_COUNT, values);
  Data_Get_Struct(self, ruby_whisper_params, rwp);
  for (i = 0; i < RUBY_WHISPER_PARAMS_PARAM_NAMES_COUNT; i++) {
--- a/bindings/ruby/ext/sources/CMakeGraphVizOptions.cmake
+++ b/bindings/ruby/ext/sources/CMakeGraphVizOptions.cmake
@ -0,0 +1,8 @@
 set(GRAPHVIZ_EXECUTABLES FALSE)
 set(GRAPHVIZ_STATIC_LIBS TRUE)
 set(GRAPHVIZ_SHARED_LIBS FALSE)
 set(GRAPHVIZ_MODULE_LIBS FALSE)
 set(GRAPHVIZ_INTERFACE_LIBS FALSE)
 set(GRAPHVIZ_OBJECT_LIBS FALSE)
 set(GRAPHVIZ_UNKNOWN_LIBS FALSE)
 set(GRAPHVIZ_GENERATE_DEPENDERS FALSE)
--- a/bindings/ruby/extsources.rb
+++ b/bindings/ruby/extsources.rb
@ -1,6 +1,34 @@
-require "yaml"
+ignored_dirs = %w[
  .devops
  examples/wchess/wchess.wasm
  examples/whisper.android
  examples/whisper.android.java
  examples/whisper.objc
  examples/whisper.swiftui
  grammars
  models
  samples
  scripts
 ]
 ignored_files = %w[
  AUTHORS
  Makefile
  README.md
  README_sycl.md
  .gitignore
  .gitmodules
  whisper.nvim
  twitch.sh
  yt-wsp.sh
 ]
-sources = `git ls-files -z ../..`.split("\x0")
+EXTSOURCES =
-paths = YAML.load_file("../../.github/workflows/bindings-ruby.yml")[true]["push"]["paths"]
+  `git ls-files -z ../..`.split("\x0")
-paths.delete "bindings/ruby/**"
+    .select {|file|
-EXTSOURCES = (Dir.glob(paths, base: "../..").collect {|path| "../../#{path}"} << "../../LICENSE") & sources
+      basename = File.basename(file)
      ignored_dirs.all? {|dir| !file.start_with?("../../#{dir}")} &&
        !ignored_files.include?(basename) &&
        (file.start_with?("../..") || file.start_with?("../javascript")) &&
        (!file.start_with?("../../.github/") || basename == "bindings-ruby.yml")
    }
--- a/bindings/ruby/lib/whisper/model/uri.rb
+++ b/bindings/ruby/lib/whisper/model/uri.rb
@ -34,7 +34,7 @@ module Whisper
               when /darwin/
                 Pathname(Dir.home)/"Library/Caches"
               else
-                 ENV.key?("XDG_CACHE_HOME") ? ENV["XDG_CACHE_HOME"] : Pathname(Dir.home)/".cache"
+                 ENV.key?("XDG_CACHE_HOME") ? Pathname(ENV["XDG_CACHE_HOME"]) : Pathname(Dir.home)/".cache"
               end
        base/"whisper.cpp"
      end
--- a/bindings/ruby/sig/whisper.rbs
+++ b/bindings/ruby/sig/whisper.rbs
@ -23,9 +23,20 @@ module Whisper
  def self.log_set: (log_callback, Object? user_data) -> log_callback
  class Context
-    def self.new: (string | _ToPath | ::URI::HTTP) -> instance
+    def self.new: (path | ::URI::HTTP) -> instance
    # transcribe a single file
    # can emit to a block results
    #
    #   params = Whisper::Params.new
    #   params.duration = 60_000
    #   whisper.transcribe "path/to/audio.wav", params do |text|
    #     puts text
    #   end
    #
    def transcribe: (string, Params) -> self
                  | (string, Params) { (String) -> void } -> self
    def model_n_vocab: () -> Integer
    def model_n_audio_ctx: () -> Integer
    def model_n_audio_state: () -> Integer
@ -34,19 +45,72 @@ module Whisper
    def model_n_mels: () -> Integer
    def model_ftype: () -> Integer
    def model_type: () -> String
    # Yields each Whisper::Segment:
    #
    #   whisper.transcribe("path/to/audio.wav", params)
    #   whisper.each_segment do |segment|
    #     puts segment.text
    #   end
    #
    # Returns an Enumerator if no block given:
    #
    #   whisper.transcribe("path/to/audio.wav", params)
    #   enum = whisper.each_segment
    #   enum.to_a # => [#<Whisper::Segment>, ...]
    #
    def each_segment: { (Segment) -> void } -> void
                    | () -> Enumerator[Segment]
    def model: () -> Model
    def full_get_segment: (Integer nth) -> Segment
    def full_n_segments: () -> Integer
    # Language ID, which can be converted to string by Whisper.lang_str and Whisper.lang_str_full.
    #
    def full_lang_id: () -> Integer
    # Start time of a segment indexed by +segment_index+ in centiseconds (10 times milliseconds).
    #
    #   full_get_segment_t0(3) # => 1668 (16680 ms)
    #
    def full_get_segment_t0: (Integer) -> Integer
    # End time of a segment indexed by +segment_index+ in centiseconds (10 times milliseconds).
    #
    #   full_get_segment_t1(3) # => 1668 (16680 ms)
    #
    def full_get_segment_t1: (Integer) -> Integer
    # Whether the next segment indexed by +segment_index+ is predicated as a speaker turn.
    #
    #   full_get_segment_speacker_turn_next(3) # => true
    #
    def full_get_segment_speaker_turn_next: (Integer) -> (true | false)
    # Text of a segment indexed by +segment_index+.
    #
    #   full_get_segment_text(3) # => "ask not what your country can do for you, ..."
    #
    def full_get_segment_text: (Integer) -> String
    def full_get_segment_no_speech_prob: (Integer) -> Float
    # Run the entire model: PCM -> log mel spectrogram -> encoder -> decoder -> text
    # Not thread safe for same context
    # Uses the specified decoding strategy to obtain the text.
    #
    # The second argument +samples+ must be an array of samples, respond to :length, or be a MemoryView of an array of float. It must be 32 bit float PCM audio data.
    #
    def full: (Params, Array[Float] samples, ?Integer n_samples) -> self
            | (Params, _Samples, ?Integer n_samples) -> self
    # Split the input audio in chunks and process each chunk separately using whisper_full_with_state()
    # Result is stored in the default state of the context
    # Not thread safe if executed in parallel on the same context.
    # It seems this approach can offer some speedup in some cases.
    # However, the transcription accuracy can be worse at the beginning and end of each chunk.
    #
    def full_parallel: (Params, Array[Float], ?Integer n_samples) -> self
                     | (Params, _Samples, ?Integer n_samples) -> self
                     | (Params, _Samples, ?Integer? n_samples, Integer n_processors) -> self
@ -85,68 +149,202 @@ module Whisper
      ?abort_callback: abort_callback,
      ?abort_callback_user_data: Object
    ) -> instance
    # params.language = "auto" | "en", etc...
    #
    def language=: (String) -> String # TODO: Enumerate lang names
    def language: () -> String
    def translate=: (boolish) -> boolish
    def translate: () -> (true | false)
    def no_context=: (boolish) -> boolish
    # If true, does not use past transcription (if any) as initial prompt for the decoder.
    #
    def no_context: () -> (true | false)
    def single_segment=: (boolish) -> boolish
    # If true, forces single segment output (useful for streaming).
    #
    def single_segment: () -> (true | false)
    def print_special=: (boolish) -> boolish
    # If true, prints special tokens (e.g. <SOT>, <EOT>, <BEG>, etc.).
    #
    def print_special: () -> (true | false)
    def print_progress=: (boolish) -> boolish
    # If true, prints progress information.
    #
    def print_progress: () -> (true | false)
    def print_realtime=: (boolish) -> boolish
    # If true, prints results from within whisper.cpp. (avoid it, use callback instead)
    #
    def print_realtime: () -> (true | false)
    # If true, prints timestamps for each text segment when printing realtime.
    #
    def print_timestamps=: (boolish) -> boolish
    def print_timestamps: () -> (true | false)
    def suppress_blank=: (boolish) -> boolish
    # If true, suppresses blank outputs.
    #
    def suppress_blank: () -> (true | false)
    def suppress_nst=: (boolish) -> boolish
    # If true, suppresses non-speech-tokens.
    #
    def suppress_nst: () -> (true | false)
    def token_timestamps=: (boolish) -> boolish
    # If true, enables token-level timestamps.
    #
    def token_timestamps: () -> (true | false)
    def split_on_word=: (boolish) -> boolish
    # If true, split on word rather than on token (when used with max_len).
    #
    def split_on_word: () -> (true | false)
    def initial_prompt=: (_ToS) -> _ToS
    # Tokens to provide to the whisper decoder as initial prompt
    # these are prepended to any existing text context from a previous call
    # use whisper_tokenize() to convert text to tokens.
    # Maximum of whisper_n_text_ctx()/2 tokens are used (typically 224).
    #
    def initial_prompt: () -> (String | nil)
    def diarize=: (boolish) -> boolish
    # If true, enables diarization.
    #
    def diarize: () -> (true | false)
    def offset=: (Integer) -> Integer
    # Start offset in ms.
    #
    def offset: () -> Integer
    def duration=: (Integer) -> Integer
    # Audio duration to process in ms.
    #
    def duration: () -> Integer
    def max_text_tokens=: (Integer) -> Integer
    # Max tokens to use from past text as prompt for the decoder.
    #
    def max_text_tokens: () -> Integer
    def temperature=: (Float) -> Float
    def temperature: () -> Float
    def max_initial_ts=: (Float) -> Float
    # See https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/decoding.py#L97
    #
    def max_initial_ts: () -> Float
    def length_penalty=: (Float) -> Float
    def length_penalty: () -> Float
    def temperature_inc=: (Float) -> Float
    def temperature_inc: () -> Float
    def entropy_thold=: (Float) -> Float
    # Similar to OpenAI's "compression_ratio_threshold"
    #
    def entropy_thold: () -> Float
    def logprob_thold=: (Float) -> Float
    def logprob_thold: () -> Float
    def no_speech_thold=: (Float) -> Float
    def no_speech_thold: () -> Float
    # Sets new segment callback, called for every newly generated text segment.
    #
    #   params.new_segment_callback = ->(context, _, n_new, user_data) {
    #     # ...
    #   }
    #
    def new_segment_callback=: (new_segment_callback) -> new_segment_callback
    def new_segment_callback: () -> (new_segment_callback | nil)
    # Sets user data passed to the last argument of new segment callback.
    #
    def new_segment_callback_user_data=: (Object) -> Object
    def new_segment_callback_user_data: () -> Object
    # Sets progress callback, called on each progress update.
    #
    #   params.new_segment_callback = ->(context, _, progress, user_data) {
    #     # ...
    #   }
    #
    # +progress+ is an Integer between 0 and 100.
    #
    def progress_callback=: (progress_callback) -> progress_callback
    def progress_callback: () -> (progress_callback | nil)
    # Sets user data passed to the last argument of progress callback.
    #
    def progress_callback_user_data=: (Object) -> Object
    def progress_callback_user_data: () -> Object
    # Sets abort callback, called to check if the process should be aborted.
    #
    #   params.abort_callback = ->(user_data) {
    #     # ...
    #   }
    #
    #
    def abort_callback=: (abort_callback) -> abort_callback
    def abort_callback: () -> (abort_callback | nil)
    # Sets user data passed to the last argument of abort callback.
    #
    def abort_callback_user_data=: (Object) -> Object
    def abort_callback_user_data: () -> Object
    # Hook called on new segment. Yields each Whisper::Segment.
    #
    #   whisper.on_new_segment do |segment|
    #     # ...
    #   end
    #
    def on_new_segment: { (Segment) -> void } -> void
    # Hook called on progress update. Yields each progress Integer between 0 and 100.
    #
    def on_progress: { (Integer progress) -> void } -> void
    # Call block to determine whether abort or not. Return +true+ when you want to abort.
    #
    #   params.abort_on do
    #     if some_condition
    #       true # abort
    #     else
    #       false # continue
    #     end
    #   end
    #
    def abort_on: { (Object user_data) -> boolish } -> void
  end
@ -167,16 +365,24 @@ module Whisper
    def type: () -> String
    class URI
-      def self.new: (string | ::URI::HTTP) -> self
+      def self.new: (string | ::URI::HTTP) -> instance
      def to_path: -> String
      def clear_cache: -> void
    end
  end
  class Segment
    # Start time in milliseconds.
    #
    def start_time: () -> Integer
    # End time in milliseconds.
    #
    def end_time: () -> Integer
    # Whether the next segment is predicted as a speaker turn.
    def speaker_next_turn?: () -> (true | false)
    def text: () -> String
    def no_speech_prob: () -> Float
  end
--- a/bindings/ruby/tests/helper.rb
+++ b/bindings/ruby/tests/helper.rb
@ -21,4 +21,15 @@ class TestBase < Test::Unit::TestCase
  def whisper
    self.class.whisper
  end
  module BuildOptions
    load "ext/options.rb", self
    Options.include self
    def enable_config(name)
    end
    def arg_config(name)
    end
  end
 end
--- a/bindings/ruby/tests/test_package.rb
+++ b/bindings/ruby/tests/test_package.rb
@ -21,11 +21,26 @@ class TestPackage < TestBase
      match_data = `rake -Tbuild`.match(/(whispercpp-(.+)\.gem)/)
      filename = match_data[1]
      version = match_data[2]
      basename = "whisper.#{RbConfig::CONFIG["DLEXT"]}"
      Dir.mktmpdir do |dir|
        system "gem", "install", "--install-dir", dir.shellescape, "--no-document", "pkg/#{filename.shellescape}", exception: true
-        assert_path_exist File.join(dir, "gems/whispercpp-#{version}/lib", basename)
+        assert_installed dir, version
      end
    end
    private
    def assert_installed(dir, version)
      assert_path_exist File.join(dir, "gems/whispercpp-#{version}/lib", "whisper.#{RbConfig::CONFIG["DLEXT"]}")
      assert_path_exist File.join(dir, "gems/whispercpp-#{version}/LICENSE")
      assert_path_not_exist File.join(dir, "gems/whispercpp-#{version}/ext/build")
    end
  end
  def test_build_options
    options = BuildOptions::Options.new
    assert_empty options.missing_options
    unless ENV["CI"]
      assert_empty options.extra_options
    end
  end
 end
--- a/bindings/ruby/whispercpp.gemspec
+++ b/bindings/ruby/whispercpp.gemspec
@ -3,8 +3,8 @@ require_relative "extsources"
 Gem::Specification.new do |s|
  s.name    = "whispercpp"
  s.authors = ["Georgi Gerganov", "Todd A. Fisher"]
-  s.version = '1.3.1'
+  s.version = '1.3.2'
-  s.date    = '2024-12-19'
+  s.date    = '2025-04-17'
  s.description = %q{High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model via Ruby}
  s.email   = 'todd.fisher@gmail.com'
  s.extra_rdoc_files = ['LICENSE', 'README.md']
@ -15,7 +15,8 @@ Gem::Specification.new do |s|
                if s.extra_rdoc_files.include?(basename)
                  basename
                else
-                  file.sub("../..", "ext")
+                  file.sub("../..", "ext/sources")
                      .sub("../javascript", "ext/sources/bindings/javascript")
                end
              }
@ -26,7 +27,7 @@ Gem::Specification.new do |s|
  s.required_ruby_version = '>= 3.1.0'
  #### Documentation and testing.
-  s.homepage = 'https://github.com/ggerganov/whisper.cpp'
+  s.homepage = 'https://github.com/ggml-org/whisper.cpp'
  s.rdoc_options = ['--main', 'README.md']
--- a/build-xcframework.sh
+++ b/build-xcframework.sh
@ -41,6 +41,11 @@ COMMON_CMAKE_ARGS=(
    -DGGML_OPENMP=${GGML_OPENMP}
 )
 XCODE_VERSION=$(xcodebuild -version 2>/dev/null | head -n1 | awk '{ print $2 }')
 MAJOR_VERSION=$(echo $XCODE_VERSION | cut -d. -f1)
 MINOR_VERSION=$(echo $XCODE_VERSION | cut -d. -f2)
 echo "Detected Xcode version: $XCODE_VERSION"
 check_required_tool() {
    local tool=$1
    local install_message=$2
@ -335,21 +340,28 @@ combine_static_libraries() {
    # Platform-specific post-processing for device builds
    if [[ "$is_simulator" == "false" ]]; then
-        if command -v vtool &>/dev/null; then
+        if command -v xcrun vtool &>/dev/null; then
            case "$platform" in
                "ios")
                    echo "Marking binary as a framework binary for iOS..."
-                    vtool -set-build-version ios ${IOS_MIN_OS_VERSION} ${IOS_MIN_OS_VERSION} -replace \
+                    xcrun vtool -set-build-version ios ${IOS_MIN_OS_VERSION} ${IOS_MIN_OS_VERSION} -replace \
                        -output "${base_dir}/${output_lib}" "${base_dir}/${output_lib}"
                    ;;
                "visionos")
                    echo "Marking binary as a framework binary for visionOS..."
-                    vtool -set-build-version xros ${VISIONOS_MIN_OS_VERSION} ${VISIONOS_MIN_OS_VERSION} -replace \
+                    if [[ "$MAJOR_VERSION" -gt 16 ]] || [[ "$MAJOR_VERSION" -eq 16 && "$MINOR_VERSION" -gt 2 ]]; then
                        echo "Xcode version greater than 16.2, using visionOS."
                        VISION_OS_BUILD_VERSION="visionos"
                    else
                        echo "Xcode version less than or equal to 16.2, using xros."
                        VISION_OS_BUILD_VERSION="xros"
                    fi
                    xcrun vtool -set-build-version ${VISION_OS_BUILD_VERSION} ${VISIONOS_MIN_OS_VERSION} ${VISIONOS_MIN_OS_VERSION} -replace \
                        -output "${base_dir}/${output_lib}" "${base_dir}/${output_lib}"
                    ;;
                "tvos")
                    echo "Marking binary as a framework binary for tvOS..."
-                    vtool -set-build-version tvos ${TVOS_MIN_OS_VERSION} ${TVOS_MIN_OS_VERSION} -replace \
+                    xcrun vtool -set-build-version tvos ${TVOS_MIN_OS_VERSION} ${TVOS_MIN_OS_VERSION} -replace \
                        -output "${base_dir}/${output_lib}" "${base_dir}/${output_lib}"
                    ;;
            esac
--- a/examples/addon.node/test/whisper.spec.js
+++ b/examples/addon.node/test/whisper.spec.js
@ -19,6 +19,12 @@ const whisperParamsMock = {
  no_timestamps: false,
  audio_ctx: 0,
  max_len: 0,
  prompt: "",
  print_progress: false,
  progress_callback: (progress) => {
    console.log(`Progress: ${progress}`);
  },
  max_context: -1
 };
 describe("Run whisper.node", () => {
--- a/examples/addon.node/addon.cpp
+++ b/examples/addon.node/addon.cpp
@ -368,6 +368,12 @@ Napi::Value whisper(const Napi::CallbackInfo& info) {
  bool comma_in_time = whisper_params.Get("comma_in_time").As<Napi::Boolean>();
  int32_t max_len = whisper_params.Get("max_len").As<Napi::Number>();
  // Add support for max_context
  int32_t max_context = -1;
  if (whisper_params.Has("max_context") && whisper_params.Get("max_context").IsNumber()) {
    max_context = whisper_params.Get("max_context").As<Napi::Number>();
  }
  // support prompt
  std::string prompt = "";
  if (whisper_params.Has("prompt") && whisper_params.Get("prompt").IsString()) {
@ -407,6 +413,7 @@ Napi::Value whisper(const Napi::CallbackInfo& info) {
  params.pcmf32 = pcmf32_vec;
  params.comma_in_time = comma_in_time;
  params.max_len = max_len;
  params.max_context = max_context;
  params.print_progress = print_progress;
  params.prompt = prompt;
--- a/examples/bench/README.md
+++ b/examples/bench/README.md
@ -4,7 +4,7 @@ A very basic tool for benchmarking the inference performance on your device. The
 the transformer on some random audio data and records the execution time. This way we can have an objective comparison
 of the performance of the model for various setups.
-Benchmark results are tracked in the following Github issue: https://github.com/ggerganov/whisper.cpp/issues/89
+Benchmark results are tracked in the following Github issue: https://github.com/ggml-org/whisper.cpp/issues/89
 ```bash
 # run the bench too on the small.en model using 4 threads
@ -40,7 +40,7 @@ system_info: n_threads = 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WA
 If you wish, you can submit these results here:
-  https://github.com/ggerganov/whisper.cpp/issues/89
+  https://github.com/ggml-org/whisper.cpp/issues/89
 Please include the following information:
--- a/examples/command/command.cpp
+++ b/examples/command/command.cpp
@ -3,7 +3,7 @@
 // Speak short text commands to the microphone.
 // This program will detect your voice command and convert them to text.
 //
-// ref: https://github.com/ggerganov/whisper.cpp/issues/171
+// ref: https://github.com/ggml-org/whisper.cpp/issues/171
 //
 #include "common-sdl.h"
--- a/examples/ffmpeg-transcode.cpp
+++ b/examples/ffmpeg-transcode.cpp
@ -249,6 +249,20 @@ static int decode_audio(struct audio_buffer *audio_buf, s16 **data, int *size)
 	/* prepare resampler */
 	swr = swr_alloc();
 #if LIBAVCODEC_VERSION_MAJOR > 60
 	AVChannelLayout in_ch_layout = codec->ch_layout;
 	AVChannelLayout out_ch_layout = AV_CHANNEL_LAYOUT_MONO;
 	/* Set the source audio layout as-is */
 	av_opt_set_chlayout(swr, "in_chlayout", &in_ch_layout, 0);
 	av_opt_set_int(swr, "in_sample_rate", codec->sample_rate, 0);
 	av_opt_set_sample_fmt(swr, "in_sample_fmt", codec->sample_fmt, 0);
 	/* Convert it into 16khz Mono */
 	av_opt_set_chlayout(swr, "out_chlayout", &out_ch_layout, 0);
 	av_opt_set_int(swr, "out_sample_rate", WAVE_SAMPLE_RATE, 0);
 	av_opt_set_sample_fmt(swr, "out_sample_fmt", AV_SAMPLE_FMT_S16, 0);
 #else
 	av_opt_set_int(swr, "in_channel_count", codec->channels, 0);
 	av_opt_set_int(swr, "out_channel_count", 1, 0);
 	av_opt_set_int(swr, "in_channel_layout", codec->channel_layout, 0);
@ -257,6 +271,7 @@ static int decode_audio(struct audio_buffer *audio_buf, s16 **data, int *size)
 	av_opt_set_int(swr, "out_sample_rate", WAVE_SAMPLE_RATE, 0);
 	av_opt_set_sample_fmt(swr, "in_sample_fmt", codec->sample_fmt, 0);
 	av_opt_set_sample_fmt(swr, "out_sample_fmt", AV_SAMPLE_FMT_S16, 0);
 #endif
 	swr_init(swr);
 	if (!swr_is_initialized(swr)) {
--- a/examples/livestream.sh
+++ b/examples/livestream.sh
@ -2,7 +2,7 @@
 #
 # Transcribe audio livestream by feeding ffmpeg output to whisper.cpp at regular intervals
 # Idea by @semiformal-net
-# ref: https://github.com/ggerganov/whisper.cpp/issues/185
+# ref: https://github.com/ggml-org/whisper.cpp/issues/185
 #
 set -eo pipefail
--- a/examples/server.py
+++ b/examples/server.py
@ -1,39 +1,115 @@
 import http.server
 import socketserver
 import os
 import sys
 from pathlib import Path
 import urllib.parse
 SCRIPT_DIR = Path(__file__).parent.absolute()
 DIRECTORY = os.path.join(SCRIPT_DIR, "../build-em/bin")
 DIRECTORY = os.path.abspath(DIRECTORY)
 # The context root we want for all applications
 CONTEXT_ROOT = "/whisper.cpp"
 class CustomHTTPRequestHandler(http.server.SimpleHTTPRequestHandler):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, directory=DIRECTORY, **kwargs)
    def do_GET(self):
-        # If requesting a worker file from any subdirectory
+        # Redirect root to the context root
-        if '.worker.js' in self.path:
+        if self.path == '/':
            self.send_response(302)
            self.send_header('Location', CONTEXT_ROOT + '/')
            self.end_headers()
            return
        # Handle requests under the context root
        if self.path.startswith(CONTEXT_ROOT):
            # Remove the context root prefix to get the actual path
            actual_path = self.path[len(CONTEXT_ROOT):]
            if not actual_path:
                self.send_response(302)
                self.send_header('Location', CONTEXT_ROOT + '/')
                self.end_headers()
                return
            if '.worker.js' in actual_path:
                worker_file = os.path.basename(actual_path)
                worker_path = os.path.join(DIRECTORY, worker_file)
                if os.path.exists(worker_path):
                    print(f"Found worker file: {worker_path}")
                    self.path = '/' + worker_file
                else:
                    print(f"Worker file not found: {worker_path}")
            elif actual_path == '/':
                self.path = '/whisper.wasm/index.html'
            elif actual_path.startswith('/bench.wasm/') or actual_path.startswith('/command.wasm/') or actual_path.startswith('/stream.wasm/'):
                # Keep the path as is, just remove the context root
                self.path = actual_path
            # For all other paths under the context root
            else:
                # Check if this is a request to a file in whisper.wasm
                potential_file = os.path.join(DIRECTORY, 'whisper.wasm', actual_path.lstrip('/'))
                if os.path.exists(potential_file) and not os.path.isdir(potential_file):
                    self.path = '/whisper.wasm' + actual_path
                else:
                    # Try to resolve the file from the base directory
                    potential_file = os.path.join(DIRECTORY, actual_path.lstrip('/'))
                    if os.path.exists(potential_file):
                        self.path = actual_path
        # For direct requests to worker files (without context root as these
        # are in the build-em/bin directory
        elif '.worker.js' in self.path:
            worker_file = os.path.basename(self.path)
            worker_path = os.path.join(DIRECTORY, worker_file)
            if os.path.exists(worker_path):
                self.path = '/' + worker_file
        # Handle coi-serviceworker.js separately
        if 'coi-serviceworker.js' in self.path:
            worker_file = "coi-serviceworker.js"
            worker_path = os.path.join(SCRIPT_DIR, worker_file)
            if os.path.exists(worker_path):
                self.send_response(200)
                self.send_header('Content-type', 'application/javascript')
                self.end_headers()
                with open(worker_path, 'rb') as file:
                    self.wfile.write(file.read())
                return
            else:
                print(f"Warning: Could not find {worker_path}")
        return super().do_GET()
    def end_headers(self):
        # Add required headers for SharedArrayBuffer
        self.send_header("Cross-Origin-Opener-Policy", "same-origin")
        self.send_header("Cross-Origin-Embedder-Policy", "require-corp")
-        self.send_header("Access-Control-Allow-Origin", "*");
+        self.send_header("Access-Control-Allow-Origin", "*")
        super().end_headers()
 PORT = 8000
-with socketserver.TCPServer(("", PORT), CustomHTTPRequestHandler) as httpd:
+# Enable address reuse
-    print(f"Serving directory '{DIRECTORY}' at http://localhost:{PORT}")
+class CustomServer(socketserver.TCPServer):
-    try:
+    allow_reuse_address = True
-        httpd.serve_forever()
+
-    except KeyboardInterrupt:
+try:
-        print("\nServer stopped.")
+    with CustomServer(("", PORT), CustomHTTPRequestHandler) as httpd:
        print(f"Serving directory '{DIRECTORY}' at http://localhost:{PORT}")
        print(f"Application context root: http://localhost:{PORT}{CONTEXT_ROOT}/")
        try:
            httpd.serve_forever()
        except KeyboardInterrupt:
            print("\nServer stopped.")
            # Force complete exit
            sys.exit(0)
 except OSError as e:
    print(f"Error: {e}")
    sys.exit(1)
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@ -79,6 +79,7 @@ struct whisper_params {
    bool use_gpu         = true;
    bool flash_attn      = false;
    bool suppress_nst    = false;
    bool no_context      = false;
    std::string language        = "en";
    std::string prompt          = "";
@ -140,6 +141,7 @@ void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params & para
    fprintf(stderr, "  --convert,                     [%-7s] Convert audio to WAV, requires ffmpeg on the server\n", sparams.ffmpeg_converter ? "true" : "false");
    fprintf(stderr, "  -sns,      --suppress-nst      [%-7s] suppress non-speech tokens\n", params.suppress_nst ? "true" : "false");
    fprintf(stderr, "  -nth N,    --no-speech-thold N [%-7.2f] no speech threshold\n",   params.no_speech_thold);
    fprintf(stderr, "  -nc,       --no-context        [%-7s] do not use previous audio context\n", params.no_context ? "true" : "false");
    fprintf(stderr, "\n");
 }
@ -186,6 +188,7 @@ bool whisper_params_parse(int argc, char ** argv, whisper_params & params, serve
        else if (arg == "-fa"   || arg == "--flash-attn")      { params.flash_attn      = true; }
        else if (arg == "-sns"  || arg == "--suppress-nst")    { params.suppress_nst    = true; }
        else if (arg == "-nth"  || arg == "--no-speech-thold") { params.no_speech_thold = std::stof(argv[++i]); }
        else if (arg == "-nc"   || arg == "--no-context")      { params.no_context      = true; }
        // server params
        else if (                  arg == "--port")            { sparams.port        = std::stoi(argv[++i]); }
@ -506,6 +509,10 @@ void get_req_parameters(const Request & req, whisper_params & params)
    {
        params.suppress_nst = parse_str_to_bool(req.get_file_value("suppress_nst").content);
    }
    if (req.has_file("no_context"))
    {
        params.no_context = parse_str_to_bool(req.get_file_value("no_context").content);
    }
 }
 }  // namespace
@ -818,6 +825,7 @@ int main(int argc, char ** argv) {
            wparams.no_timestamps    = params.no_timestamps;
            wparams.token_timestamps = !params.no_timestamps && params.response_format == vjson_format;
            wparams.no_context       = params.no_context;
            wparams.suppress_nst     = params.suppress_nst;
--- a/examples/twitch.sh
+++ b/examples/twitch.sh
@ -2,7 +2,7 @@
 #
 # Transcribe twitch.tv livestream by feeding audio input to whisper.cpp at regular intervals
 # Thanks to @keyehzy
-# ref: https://github.com/ggerganov/whisper.cpp/issues/209
+# ref: https://github.com/ggml-org/whisper.cpp/issues/209
 #
 # The script currently depends on the third-party tool "streamlink"
 # On Mac OS, you can install it via "brew install streamlink"
--- a/examples/whisper.android.java/app/src/main/jni/whisper/CMakeLists.txt
+++ b/examples/whisper.android.java/app/src/main/jni/whisper/CMakeLists.txt
@ -14,6 +14,8 @@ set(SOURCE_FILES
    ${WHISPER_LIB_DIR}/ggml/src/ggml-cpu/ggml-cpu.cpp
    ${WHISPER_LIB_DIR}/ggml/src/ggml-cpu/unary-ops.cpp
    ${WHISPER_LIB_DIR}/ggml/src/ggml-cpu/binary-ops.cpp
    ${WHISPER_LIB_DIR}/ggml/src/ggml-cpu/vec.cpp
    ${WHISPER_LIB_DIR}/ggml/src/ggml-cpu/ops.cpp
    ${WHISPER_LIB_DIR}/ggml/src/ggml-alloc.c
    ${WHISPER_LIB_DIR}/ggml/src/ggml-backend.cpp
    ${WHISPER_LIB_DIR}/ggml/src/ggml-backend-reg.cpp
--- a/examples/whisper.android/lib/src/main/jni/whisper/CMakeLists.txt
+++ b/examples/whisper.android/lib/src/main/jni/whisper/CMakeLists.txt
@ -34,6 +34,8 @@ if (NOT GGML_HOME)
        ${WHISPER_LIB_DIR}/ggml/src/ggml-cpu/ggml-cpu-traits.cpp
        ${WHISPER_LIB_DIR}/ggml/src/ggml-cpu/unary-ops.cpp
        ${WHISPER_LIB_DIR}/ggml/src/ggml-cpu/binary-ops.cpp
        ${WHISPER_LIB_DIR}/ggml/src/ggml-cpu/vec.cpp
        ${WHISPER_LIB_DIR}/ggml/src/ggml-cpu/ops.cpp
        )
 endif()
--- a/examples/whisper.nvim/whisper.nvim
+++ b/examples/whisper.nvim/whisper.nvim
@ -5,7 +5,7 @@
 # This simple script is called by Neovim to capture audio from the microphone and transcribe it with Whisper.
 # In order for this to work, you need to clone the whisper.cpp repo and build the 'stream' tool
 #
-#   git clone https://github.com/ggerganov/whisper.cpp
+#   git clone https://github.com/ggml-org/whisper.cpp
 #   cd whisper.cpp
 #   make stream
 #
@ -31,7 +31,7 @@
 model="base.en"
 # export the path to the whisper.cpp repo in the WHISPER_CPP_HOME env variable
-# https://github.com/ggerganov/whisper.cpp
+# https://github.com/ggml-org/whisper.cpp
 cd "${WHISPER_CPP_HOME}"
 if [ ! -f ./stream ] ; then
--- a/examples/whisper.wasm/CMakeLists.txt
+++ b/examples/whisper.wasm/CMakeLists.txt
@ -36,7 +36,7 @@ set_target_properties(${TARGET} PROPERTIES LINK_FLAGS " \
    -s MAXIMUM_MEMORY=2000MB \
    -s ALLOW_MEMORY_GROWTH=1 \
    -s FORCE_FILESYSTEM=1 \
-    -s EXPORTED_RUNTIME_METHODS=\"['print', 'printErr', 'ccall', 'cwrap']\" \
+    -s EXPORTED_RUNTIME_METHODS=\"['print', 'printErr', 'ccall', 'cwrap', 'HEAPU8']\" \
    ${EXTRA_FLAGS} \
    ")
--- a/examples/whisper.wasm/README.md
+++ b/examples/whisper.wasm/README.md
@ -30,7 +30,7 @@ Link: https://ggerganov.github.io/whisper.cpp/
 ```bash (v3.1.2)
 # build using Emscripten
-git clone https://github.com/ggerganov/whisper.cpp
+git clone https://github.com/ggml-org/whisper.cpp
 cd whisper.cpp
 mkdir build-em && cd build-em
 emcmake cmake ..
--- a/examples/whisper.wasm/emscripten.cpp
+++ b/examples/whisper.wasm/emscripten.cpp
@ -65,13 +65,14 @@ EMSCRIPTEN_BINDINGS(whisper) {
        }
        struct whisper_full_params params = whisper_full_default_params(whisper_sampling_strategy::WHISPER_SAMPLING_GREEDY);
        bool is_multilingual = whisper_is_multilingual(g_contexts[index]);
        params.print_realtime   = true;
        params.print_progress   = false;
        params.print_timestamps = true;
        params.print_special    = false;
        params.translate        = translate;
-        params.language         = whisper_is_multilingual(g_contexts[index]) ? lang.c_str() : "en";
+        params.language         = is_multilingual ? strdup(lang.c_str()) : "en";
        params.n_threads        = std::min(nthreads, std::min(16, mpow2(std::thread::hardware_concurrency())));
        params.offset_ms        = 0;
@ -102,10 +103,13 @@ EMSCRIPTEN_BINDINGS(whisper) {
        // run the worker
        {
-            g_worker = std::thread([index, params, pcmf32 = std::move(pcmf32)]() {
+            g_worker = std::thread([index, params, pcmf32 = std::move(pcmf32), is_multilingual]() {
                whisper_reset_timings(g_contexts[index]);
                whisper_full(g_contexts[index], params, pcmf32.data(), pcmf32.size());
                whisper_print_timings(g_contexts[index]);
                if (is_multilingual) {
                    free((void*)params.language);
                }
            });
        }
--- a/examples/yt-wsp.sh
+++ b/examples/yt-wsp.sh
@ -25,12 +25,12 @@
 # SOFTWARE.
 # Small shell script to more easily automatically download and transcribe live stream VODs.
-# This uses YT-DLP, ffmpeg and the CPP version of Whisper: https://github.com/ggerganov/whisper.cpp
+# This uses YT-DLP, ffmpeg and the CPP version of Whisper: https://github.com/ggml-org/whisper.cpp
 # Use `./examples/yt-wsp.sh help` to print help info.
 #
 # Sample usage:
 #
-#   git clone https://github.com/ggerganov/whisper.cpp
+#   git clone https://github.com/ggml-org/whisper.cpp
 #   cd whisper.cpp
 #   make
 #   ./examples/yt-wsp.sh https://www.youtube.com/watch?v=1234567890
@ -44,7 +44,7 @@ SCRIPT_DIR="${SCRIPT_PATH%/*}"
 ################################################################################
 # Documentation on downloading models can be found in the whisper.cpp repo:
-# https://github.com/ggerganov/whisper.cpp/#usage
+# https://github.com/ggml-org/whisper.cpp/#usage
 #
 # note: unless a multilingual model is specified, WHISPER_LANG will be ignored
 # and the video will be transcribed as if the audio were in the English language
@ -103,10 +103,10 @@ check_requirements() {
    fi;
    if ! command -v "${WHISPER_EXECUTABLE}" &>/dev/null; then
-        echo "The C++ implementation of Whisper is required: https://github.com/ggerganov/whisper.cpp"
+        echo "The C++ implementation of Whisper is required: https://github.com/ggml-org/whisper.cpp"
        echo "Sample usage:";
        echo "";
-        echo "  git clone https://github.com/ggerganov/whisper.cpp";
+        echo "  git clone https://github.com/ggml-org/whisper.cpp";
        echo "  cd whisper.cpp";
        echo "  make";
        echo "  ./examples/yt-wsp.sh https://www.youtube.com/watch?v=1234567890";
--- a/ggml/src/ggml-cpu/CMakeLists.txt
+++ b/ggml/src/ggml-cpu/CMakeLists.txt
@ -28,6 +28,11 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
        ggml-cpu/binary-ops.cpp
        ggml-cpu/unary-ops.h
        ggml-cpu/unary-ops.cpp
        ggml-cpu/simd-mappings.h
        ggml-cpu/vec.h
        ggml-cpu/vec.cpp
        ggml-cpu/ops.h
        ggml-cpu/ops.cpp
        )
    target_compile_features(${GGML_CPU_NAME} PRIVATE c_std_11 cxx_std_17)
--- a/ggml/src/ggml-cpu/ggml-cpu.c
+++ b/ggml/src/ggml-cpu/ggml-cpu.c
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
--- a/ggml/src/ggml-cpu/ops.h
+++ b/ggml/src/ggml-cpu/ops.h
@ -0,0 +1,128 @@
 #pragma once
 #include "ggml.h"
 //
 // cache line
 //
 #if defined(__cpp_lib_hardware_interference_size)
 #define CACHE_LINE_SIZE std::hardware_destructive_interference_size
 #else
 #if defined(__POWER9_VECTOR__)
 #define CACHE_LINE_SIZE 128
 #elif defined(__VXE__) || defined(__VXE2__)
 #define CACHE_LINE_SIZE 256
 #else
 #define CACHE_LINE_SIZE 64
 #endif
 #endif
 static const size_t CACHE_LINE_SIZE_F32 = CACHE_LINE_SIZE/sizeof(float);
 #ifdef __cplusplus
 extern "C" {
 #endif
 void ggml_compute_forward_dup(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_add(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_add1(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_acc(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_sum(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_sum_rows(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_mean(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_argmax(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_count_equal(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_repeat(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_repeat_back(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_concat(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_silu_back(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_norm(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_rms_norm(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_rms_norm_back(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_group_norm(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_l2_norm(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_out_prod(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_scale(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_set(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_cpy(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_cont(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_reshape(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_view(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_permute(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_transpose(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_get_rows(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_get_rows_back(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_diag(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_diag_mask_inf(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_diag_mask_zero(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_soft_max(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_soft_max_ext_back(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_rope(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_rope_back(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_clamp(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_conv_transpose_1d(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_im2col(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_im2col_back_f32(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_conv_transpose_2d(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_pool_1d(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_pool_2d(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_pool_2d_back(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_upscale(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_pad(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_pad_reflect_1d(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_arange(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_timestep_embedding(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_argsort(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_leaky_relu(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_flash_attn_ext(
    const struct ggml_compute_params * params,
    const struct ggml_tensor * q,
    const struct ggml_tensor * k,
    const struct ggml_tensor * v,
    const struct ggml_tensor * mask,
    struct ggml_tensor * dst);
 void ggml_compute_forward_flash_attn_back(
        const struct ggml_compute_params * params,
        const bool masked,
        struct ggml_tensor * dst);
 void ggml_compute_forward_ssm_conv(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_ssm_scan(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_win_part(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_win_unpart(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_unary(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_get_rel_pos(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_add_rel_pos(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_rwkv_wkv6(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_rwkv_wkv7(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_gla(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_map_unary(
    const struct ggml_compute_params * params,
    struct ggml_tensor * dst,
    const ggml_unary_op_f32_t fun);
 void ggml_compute_forward_map_binary(
    const struct ggml_compute_params * params,
    struct ggml_tensor * dst,
    const ggml_binary_op_f32_t fun);
 void ggml_compute_forward_map_custom1_f32(
    const struct ggml_compute_params * params,
    struct ggml_tensor * dst,
    const ggml_custom1_op_f32_t fun);
 void ggml_compute_forward_map_custom2_f32(
    const struct ggml_compute_params * params,
    struct ggml_tensor * dst,
    const ggml_custom2_op_f32_t fun);
 void ggml_compute_forward_map_custom3_f32(
    const struct ggml_compute_params * params,
    struct ggml_tensor * dst,
    const ggml_custom3_op_f32_t fun);
 void ggml_compute_forward_map_custom1(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_map_custom2(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_map_custom3(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_cross_entropy_loss(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_cross_entropy_loss_back(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 void ggml_compute_forward_opt_step_adamw(const struct ggml_compute_params * params, struct ggml_tensor * dst);
 #ifdef __cplusplus
 }
 #endif
--- a/ggml/src/ggml-cpu/simd-mappings.h
+++ b/ggml/src/ggml-cpu/simd-mappings.h
@ -0,0 +1,884 @@
 #pragma once
 #include "ggml-cpu-impl.h"
 //
 // simd mappings
 //
 // we define a common set of C macros which map to specific intrinsics based on the current architecture
 // we then implement the fundamental computation operations below using only these macros
 // adding support for new architectures requires to define the corresponding SIMD macros
 //
 // GGML_F32_STEP / GGML_F16_STEP
 //   number of elements to process in a single step
 //
 // GGML_F32_EPR / GGML_F16_EPR
 //   number of elements to fit in a single register
 //
 #if defined(__ARM_NEON) && defined(__ARM_FEATURE_FMA)
 #define GGML_SIMD
 // F32 NEON
 #define GGML_F32_STEP 16
 #define GGML_F32_EPR  4
 #define GGML_F32x4              float32x4_t
 #define GGML_F32x4_ZERO         vdupq_n_f32(0.0f)
 #define GGML_F32x4_SET1(x)      vdupq_n_f32(x)
 #define GGML_F32x4_LOAD         vld1q_f32
 #define GGML_F32x4_STORE        vst1q_f32
 #define GGML_F32x4_FMA(a, b, c) vfmaq_f32(a, b, c)
 #define GGML_F32x4_ADD          vaddq_f32
 #define GGML_F32x4_MUL          vmulq_f32
 #define GGML_F32x4_REDUCE_ONE(x) vaddvq_f32(x)
 #define GGML_F32x4_REDUCE(res, x)                       \
 {                                                       \
    int offset = GGML_F32_ARR >> 1;                     \
    for (int i = 0; i < offset; ++i) {                  \
        (x)[i] = vaddq_f32((x)[i], (x)[offset+i]);      \
    }                                                   \
    offset >>= 1;                                       \
    for (int i = 0; i < offset; ++i) {                  \
        (x)[i] = vaddq_f32((x)[i], (x)[offset+i]);      \
    }                                                   \
    offset >>= 1;                                       \
    for (int i = 0; i < offset; ++i) {                  \
        (x)[i] = vaddq_f32((x)[i], (x)[offset+i]);      \
    }                                                   \
    (res) = (ggml_float) GGML_F32x4_REDUCE_ONE((x)[0]); \
 }
 #define GGML_F32_VEC        GGML_F32x4
 #define GGML_F32_VEC_ZERO   GGML_F32x4_ZERO
 #define GGML_F32_VEC_SET1   GGML_F32x4_SET1
 #define GGML_F32_VEC_LOAD   GGML_F32x4_LOAD
 #define GGML_F32_VEC_STORE  GGML_F32x4_STORE
 #define GGML_F32_VEC_FMA    GGML_F32x4_FMA
 #define GGML_F32_VEC_ADD    GGML_F32x4_ADD
 #define GGML_F32_VEC_MUL    GGML_F32x4_MUL
 #define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE
 // F16 NEON
 #if defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
    #define GGML_F16_STEP 32
    #define GGML_F16_EPR  8
    #define GGML_F16x8              float16x8_t
    #define GGML_F16x8_ZERO         vdupq_n_f16(0.0f)
    #define GGML_F16x8_SET1(x)      vdupq_n_f16(x)
    #define GGML_F16x8_LOAD(x)      vld1q_f16((const ggml_fp16_internal_t *)(x))
    #define GGML_F16x8_STORE        vst1q_f16
    #define GGML_F16x8_FMA(a, b, c) vfmaq_f16(a, b, c)
    #define GGML_F16x8_ADD          vaddq_f16
    #define GGML_F16x8_MUL          vmulq_f16
    #define GGML_F16x8_REDUCE(res, x)                               \
    do {                                                            \
        int offset = GGML_F16_ARR >> 1;                             \
        for (int i = 0; i < offset; ++i) {                          \
            (x)[i] = vaddq_f16((x)[i], (x)[offset+i]);              \
        }                                                           \
        offset >>= 1;                                               \
        for (int i = 0; i < offset; ++i) {                          \
            (x)[i] = vaddq_f16((x)[i], (x)[offset+i]);              \
        }                                                           \
        offset >>= 1;                                               \
        for (int i = 0; i < offset; ++i) {                          \
            (x)[i] = vaddq_f16((x)[i], (x)[offset+i]);              \
        }                                                           \
        const float32x4_t t0 = vcvt_f32_f16(vget_low_f16 ((x)[0])); \
        const float32x4_t t1 = vcvt_f32_f16(vget_high_f16((x)[0])); \
        (res) = (ggml_float) vaddvq_f32(vaddq_f32(t0, t1));         \
    } while (0)
    #define GGML_F16_VEC                GGML_F16x8
    #define GGML_F16_VEC_ZERO           GGML_F16x8_ZERO
    #define GGML_F16_VEC_SET1           GGML_F16x8_SET1
    #define GGML_F16_VEC_LOAD(p, i)     GGML_F16x8_LOAD(p)
    #define GGML_F16_VEC_STORE(p, r, i) GGML_F16x8_STORE((ggml_fp16_internal_t *)(p), (r)[i])
    #define GGML_F16_VEC_FMA            GGML_F16x8_FMA
    #define GGML_F16_VEC_ADD            GGML_F16x8_ADD
    #define GGML_F16_VEC_MUL            GGML_F16x8_MUL
    #define GGML_F16_VEC_REDUCE         GGML_F16x8_REDUCE
 #else
    // if FP16 vector arithmetic is not supported, we use FP32 instead
    // and take advantage of the vcvt_ functions to convert to/from FP16
    #define GGML_F16_STEP 16
    #define GGML_F16_EPR  4
    #define GGML_F32Cx4              float32x4_t
    #define GGML_F32Cx4_ZERO         vdupq_n_f32(0.0f)
    #define GGML_F32Cx4_SET1(x)      vdupq_n_f32(x)
    #define GGML_F32Cx4_LOAD(x)      vcvt_f32_f16(vld1_f16((const ggml_fp16_internal_t *)(x)))
    #define GGML_F32Cx4_STORE(x, y)  vst1_f16(x, vcvt_f16_f32(y))
    #define GGML_F32Cx4_FMA(a, b, c) vfmaq_f32(a, b, c)
    #define GGML_F32Cx4_ADD          vaddq_f32
    #define GGML_F32Cx4_MUL          vmulq_f32
    #define GGML_F32Cx4_REDUCE       GGML_F32x4_REDUCE
    #define GGML_F16_VEC                GGML_F32Cx4
    #define GGML_F16_VEC_ZERO           GGML_F32Cx4_ZERO
    #define GGML_F16_VEC_SET1           GGML_F32Cx4_SET1
    #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx4_LOAD(p)
    #define GGML_F16_VEC_STORE(p, r, i) GGML_F32Cx4_STORE((ggml_fp16_internal_t *)(p), r[i])
    #define GGML_F16_VEC_FMA            GGML_F32Cx4_FMA
    #define GGML_F16_VEC_ADD            GGML_F32Cx4_ADD
    #define GGML_F16_VEC_MUL            GGML_F32Cx4_MUL
    #define GGML_F16_VEC_REDUCE         GGML_F32Cx4_REDUCE
 #endif
 #elif defined(__AVX512F__)
 #define GGML_SIMD
 // F32 AVX512
 #define GGML_F32_STEP 64
 #define GGML_F32_EPR  16
 #define GGML_F32x16         __m512
 #define GGML_F32x16_ZERO    _mm512_setzero_ps()
 #define GGML_F32x16_SET1(x) _mm512_set1_ps(x)
 #define GGML_F32x16_LOAD    _mm512_loadu_ps
 #define GGML_F32x16_STORE   _mm512_storeu_ps
 // _mm512_fmadd_ps is defined in AVX512F so no guard is required
 #define GGML_F32x16_FMA(a, b, c) _mm512_fmadd_ps(b, c, a)
 #define GGML_F32x16_ADD     _mm512_add_ps
 #define GGML_F32x16_MUL     _mm512_mul_ps
 #define GGML_F32x16_REDUCE(res, x)                                    \
 do {                                                                  \
    int offset = GGML_F32_ARR >> 1;                                   \
    for (int i = 0; i < offset; ++i) {                                \
        x[i] = _mm512_add_ps(x[i], x[offset+i]);                      \
    }                                                                 \
    offset >>= 1;                                                     \
    for (int i = 0; i < offset; ++i) {                                \
        x[i] = _mm512_add_ps(x[i], x[offset+i]);                      \
    }                                                                 \
    offset >>= 1;                                                     \
    for (int i = 0; i < offset; ++i) {                                \
        x[i] = _mm512_add_ps(x[i], x[offset+i]);                      \
    }                                                                 \
    res = (ggml_float) _mm512_reduce_add_ps(x[0]);                    \
 } while (0)
 // TODO: is this optimal ?
 #define GGML_F32_VEC        GGML_F32x16
 #define GGML_F32_VEC_ZERO   GGML_F32x16_ZERO
 #define GGML_F32_VEC_SET1   GGML_F32x16_SET1
 #define GGML_F32_VEC_LOAD   GGML_F32x16_LOAD
 #define GGML_F32_VEC_STORE  GGML_F32x16_STORE
 #define GGML_F32_VEC_FMA    GGML_F32x16_FMA
 #define GGML_F32_VEC_ADD    GGML_F32x16_ADD
 #define GGML_F32_VEC_MUL    GGML_F32x16_MUL
 #define GGML_F32_VEC_REDUCE GGML_F32x16_REDUCE
 // F16 AVX512
 // F16 AVX
 #define GGML_F16_STEP 64
 #define GGML_F16_EPR  16
 // AVX512 has FP16 extension (AVX512_FP16) but I don't have it on my machine so I use FP32 instead
 #define GGML_F32Cx16             __m512
 #define GGML_F32Cx16_ZERO        _mm512_setzero_ps()
 #define GGML_F32Cx16_SET1(x)     _mm512_set1_ps(x)
 // unlike  _mm256_cvt intrinsics that require F16C, _mm512_cvt is defined in AVX512F
 // so F16C guard isn't required
 #define GGML_F32Cx16_LOAD(x)     _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)(x)))
 #define GGML_F32Cx16_STORE(x, y) _mm256_storeu_si256((__m256i *)(x), _mm512_cvtps_ph(y, 0))
 #define GGML_F32Cx16_FMA(a, b, c) _mm512_fmadd_ps(b, c, a)
 #define GGML_F32Cx16_ADD         _mm512_add_ps
 #define GGML_F32Cx16_MUL         _mm512_mul_ps
 #define GGML_F32Cx16_REDUCE(res, x)                               \
 do {                                                              \
    int offset = GGML_F32_ARR >> 1;                               \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = _mm512_add_ps(x[i], x[offset+i]);                  \
    }                                                             \
    offset >>= 1;                                                 \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = _mm512_add_ps(x[i], x[offset+i]);                  \
    }                                                             \
    offset >>= 1;                                                 \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = _mm512_add_ps(x[i], x[offset+i]);                  \
    }                                                             \
    res = (ggml_float) _mm512_reduce_add_ps(x[0]);                \
 } while (0)
 #define GGML_F16_VEC                GGML_F32Cx16
 #define GGML_F16_VEC_ZERO           GGML_F32Cx16_ZERO
 #define GGML_F16_VEC_SET1           GGML_F32Cx16_SET1
 #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx16_LOAD(p)
 #define GGML_F16_VEC_STORE(p, r, i) GGML_F32Cx16_STORE(p, r[i])
 #define GGML_F16_VEC_FMA            GGML_F32Cx16_FMA
 #define GGML_F16_VEC_ADD            GGML_F32Cx16_ADD
 #define GGML_F16_VEC_MUL            GGML_F32Cx16_MUL
 #define GGML_F16_VEC_REDUCE         GGML_F32Cx16_REDUCE
 #elif defined(__AVX__)
 #define GGML_SIMD
 // F32 AVX
 #define GGML_F32_STEP 32
 #define GGML_F32_EPR  8
 #define GGML_F32x8         __m256
 #define GGML_F32x8_ZERO    _mm256_setzero_ps()
 #define GGML_F32x8_SET1(x) _mm256_set1_ps(x)
 #define GGML_F32x8_LOAD    _mm256_loadu_ps
 #define GGML_F32x8_STORE   _mm256_storeu_ps
 #if defined(__FMA__)
    #define GGML_F32x8_FMA(a, b, c) _mm256_fmadd_ps(b, c, a)
 #else
    #define GGML_F32x8_FMA(a, b, c) _mm256_add_ps(_mm256_mul_ps(b, c), a)
 #endif
 #define GGML_F32x8_ADD     _mm256_add_ps
 #define GGML_F32x8_MUL     _mm256_mul_ps
 #define GGML_F32x8_REDUCE(res, x)                                 \
 do {                                                              \
    int offset = GGML_F32_ARR >> 1;                               \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = _mm256_add_ps(x[i], x[offset+i]);                  \
    }                                                             \
    offset >>= 1;                                                 \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = _mm256_add_ps(x[i], x[offset+i]);                  \
    }                                                             \
    offset >>= 1;                                                 \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = _mm256_add_ps(x[i], x[offset+i]);                  \
    }                                                             \
    const __m128 t0 = _mm_add_ps(_mm256_castps256_ps128(x[0]),    \
                                 _mm256_extractf128_ps(x[0], 1)); \
    const __m128 t1 = _mm_hadd_ps(t0, t0);                        \
    res = (ggml_float) _mm_cvtss_f32(_mm_hadd_ps(t1, t1));        \
 } while (0)
 // TODO: is this optimal ?
 #define GGML_F32_VEC        GGML_F32x8
 #define GGML_F32_VEC_ZERO   GGML_F32x8_ZERO
 #define GGML_F32_VEC_SET1   GGML_F32x8_SET1
 #define GGML_F32_VEC_LOAD   GGML_F32x8_LOAD
 #define GGML_F32_VEC_STORE  GGML_F32x8_STORE
 #define GGML_F32_VEC_FMA    GGML_F32x8_FMA
 #define GGML_F32_VEC_ADD    GGML_F32x8_ADD
 #define GGML_F32_VEC_MUL    GGML_F32x8_MUL
 #define GGML_F32_VEC_REDUCE GGML_F32x8_REDUCE
 // F16 AVX
 #define GGML_F16_STEP 32
 #define GGML_F16_EPR  8
 // F16 arithmetic is not supported by AVX, so we use F32 instead
 #define GGML_F32Cx8             __m256
 #define GGML_F32Cx8_ZERO        _mm256_setzero_ps()
 #define GGML_F32Cx8_SET1(x)     _mm256_set1_ps(x)
 #if defined(__F16C__)
 // the  _mm256_cvt intrinsics require F16C
 #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((const __m128i *)(x)))
 #define GGML_F32Cx8_STORE(x, y) _mm_storeu_si128((__m128i *)(x), _mm256_cvtps_ph(y, 0))
 #else
 static inline __m256 __avx_f32cx8_load(const ggml_fp16_t * x) {
    float tmp[8];
    for (int i = 0; i < 8; i++) {
        tmp[i] = GGML_FP16_TO_FP32(x[i]);
    }
    return _mm256_loadu_ps(tmp);
 }
 static inline void __avx_f32cx8_store(ggml_fp16_t *x, __m256 y) {
    float arr[8];
    _mm256_storeu_ps(arr, y);
    for (int i = 0; i < 8; i++)
        x[i] = GGML_FP32_TO_FP16(arr[i]);
 }
 #define GGML_F32Cx8_LOAD(x)     __avx_f32cx8_load(x)
 #define GGML_F32Cx8_STORE(x, y) __avx_f32cx8_store(x, y)
 #endif
 #define GGML_F32Cx8_FMA         GGML_F32x8_FMA
 #define GGML_F32Cx8_ADD         _mm256_add_ps
 #define GGML_F32Cx8_MUL         _mm256_mul_ps
 #define GGML_F32Cx8_REDUCE      GGML_F32x8_REDUCE
 #define GGML_F16_VEC                GGML_F32Cx8
 #define GGML_F16_VEC_ZERO           GGML_F32Cx8_ZERO
 #define GGML_F16_VEC_SET1           GGML_F32Cx8_SET1
 #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
 #define GGML_F16_VEC_STORE(p, r, i) GGML_F32Cx8_STORE(p, r[i])
 #define GGML_F16_VEC_FMA            GGML_F32Cx8_FMA
 #define GGML_F16_VEC_ADD            GGML_F32Cx8_ADD
 #define GGML_F16_VEC_MUL            GGML_F32Cx8_MUL
 #define GGML_F16_VEC_REDUCE         GGML_F32Cx8_REDUCE
 #elif defined(__POWER9_VECTOR__)
 #define GGML_SIMD
 // F32 POWER9
 #define GGML_F32_STEP 32
 #define GGML_F32_EPR  4
 #define GGML_F32x4              vector float
 #define GGML_F32x4_ZERO         0.0f
 #define GGML_F32x4_SET1         vec_splats
 #define GGML_F32x4_LOAD(p)      vec_xl(0, p)
 #define GGML_F32x4_STORE(p, r)  vec_xst(r, 0, p)
 #define GGML_F32x4_FMA(a, b, c) vec_madd(b, c, a)
 #define GGML_F32x4_ADD          vec_add
 #define GGML_F32x4_MUL          vec_mul
 #define GGML_F32x4_REDUCE(res, x)              \
 {                                              \
    int offset = GGML_F32_ARR >> 1;            \
    for (int i = 0; i < offset; ++i) {         \
        x[i] = vec_add(x[i], x[offset+i]);     \
    }                                          \
    offset >>= 1;                              \
    for (int i = 0; i < offset; ++i) {         \
        x[i] = vec_add(x[i], x[offset+i]);     \
    }                                          \
    offset >>= 1;                              \
    for (int i = 0; i < offset; ++i) {         \
        x[i] = vec_add(x[i], x[offset+i]);     \
    }                                          \
    res = vec_extract(x[0], 0) +               \
          vec_extract(x[0], 1) +               \
          vec_extract(x[0], 2) +               \
          vec_extract(x[0], 3);                \
 }
 #define GGML_F32_VEC        GGML_F32x4
 #define GGML_F32_VEC_ZERO   GGML_F32x4_ZERO
 #define GGML_F32_VEC_SET1   GGML_F32x4_SET1
 #define GGML_F32_VEC_LOAD   GGML_F32x4_LOAD
 #define GGML_F32_VEC_STORE  GGML_F32x4_STORE
 #define GGML_F32_VEC_FMA    GGML_F32x4_FMA
 #define GGML_F32_VEC_ADD    GGML_F32x4_ADD
 #define GGML_F32_VEC_MUL    GGML_F32x4_MUL
 #define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE
 // F16 POWER9
 #define GGML_F16_STEP       GGML_F32_STEP
 #define GGML_F16_EPR        GGML_F32_EPR
 #define GGML_F16_VEC        GGML_F32x4
 #define GGML_F16_VEC_ZERO   GGML_F32x4_ZERO
 #define GGML_F16_VEC_SET1   GGML_F32x4_SET1
 #define GGML_F16_VEC_FMA    GGML_F32x4_FMA
 #define GGML_F16_VEC_ADD    GGML_F32x4_ADD
 #define GGML_F16_VEC_MUL    GGML_F32x4_MUL
 #define GGML_F16_VEC_REDUCE GGML_F32x4_REDUCE
 // Use vec_xl, not vec_ld, in case the load address is not aligned.
 #define GGML_F16_VEC_LOAD(p, i) (i & 0x1) ?                   \
  vec_extract_fp32_from_shorth(vec_xl(0, p - GGML_F16_EPR)) : \
  vec_extract_fp32_from_shortl(vec_xl(0, p))
 #define GGML_ENDIAN_BYTE(i) ((unsigned char *)&(uint16_t){1})[i]
 #define GGML_F16_VEC_STORE(p, r, i)                             \
  if (i & 0x1)                                                  \
    vec_xst(vec_pack_to_short_fp32(r[i - GGML_ENDIAN_BYTE(1)],  \
                                   r[i - GGML_ENDIAN_BYTE(0)]), \
            0, p - GGML_F16_EPR)
 #elif defined(__wasm_simd128__)
 #define GGML_SIMD
 // F32 WASM
 #define GGML_F32_STEP 16
 #define GGML_F32_EPR  4
 #define GGML_F32x4              v128_t
 #define GGML_F32x4_ZERO         wasm_f32x4_splat(0.0f)
 #define GGML_F32x4_SET1(x)      wasm_f32x4_splat(x)
 #define GGML_F32x4_LOAD         wasm_v128_load
 #define GGML_F32x4_STORE        wasm_v128_store
 #define GGML_F32x4_FMA(a, b, c) wasm_f32x4_add(wasm_f32x4_mul(b, c), a)
 #define GGML_F32x4_ADD          wasm_f32x4_add
 #define GGML_F32x4_MUL          wasm_f32x4_mul
 #define GGML_F32x4_REDUCE(res, x)                  \
 {                                                  \
    int offset = GGML_F32_ARR >> 1;                \
    for (int i = 0; i < offset; ++i) {             \
        x[i] = wasm_f32x4_add(x[i], x[offset+i]);  \
    }                                              \
    offset >>= 1;                                  \
    for (int i = 0; i < offset; ++i) {             \
        x[i] = wasm_f32x4_add(x[i], x[offset+i]);  \
    }                                              \
    offset >>= 1;                                  \
    for (int i = 0; i < offset; ++i) {             \
        x[i] = wasm_f32x4_add(x[i], x[offset+i]);  \
    }                                              \
    res = wasm_f32x4_extract_lane(x[0], 0) +       \
          wasm_f32x4_extract_lane(x[0], 1) +       \
          wasm_f32x4_extract_lane(x[0], 2) +       \
          wasm_f32x4_extract_lane(x[0], 3);        \
 }
 #define GGML_F32_VEC        GGML_F32x4
 #define GGML_F32_VEC_ZERO   GGML_F32x4_ZERO
 #define GGML_F32_VEC_SET1   GGML_F32x4_SET1
 #define GGML_F32_VEC_LOAD   GGML_F32x4_LOAD
 #define GGML_F32_VEC_STORE  GGML_F32x4_STORE
 #define GGML_F32_VEC_FMA    GGML_F32x4_FMA
 #define GGML_F32_VEC_ADD    GGML_F32x4_ADD
 #define GGML_F32_VEC_MUL    GGML_F32x4_MUL
 #define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE
 // F16 WASM
 #define GGML_F16_STEP 16
 #define GGML_F16_EPR  4
 inline static v128_t __wasm_f16x4_load(const ggml_fp16_t * p) {
    float tmp[4];
    tmp[0] = GGML_FP16_TO_FP32(p[0]);
    tmp[1] = GGML_FP16_TO_FP32(p[1]);
    tmp[2] = GGML_FP16_TO_FP32(p[2]);
    tmp[3] = GGML_FP16_TO_FP32(p[3]);
    return wasm_v128_load(tmp);
 }
 inline static void __wasm_f16x4_store(ggml_fp16_t * p, v128_t x) {
    float tmp[4];
    wasm_v128_store(tmp, x);
    p[0] = GGML_FP32_TO_FP16(tmp[0]);
    p[1] = GGML_FP32_TO_FP16(tmp[1]);
    p[2] = GGML_FP32_TO_FP16(tmp[2]);
    p[3] = GGML_FP32_TO_FP16(tmp[3]);
 }
 #define GGML_F16x4             v128_t
 #define GGML_F16x4_ZERO        wasm_f32x4_splat(0.0f)
 #define GGML_F16x4_SET1(x)     wasm_f32x4_splat(x)
 #define GGML_F16x4_LOAD(x)     __wasm_f16x4_load(x)
 #define GGML_F16x4_STORE(x, y) __wasm_f16x4_store(x, y)
 #define GGML_F16x4_FMA         GGML_F32x4_FMA
 #define GGML_F16x4_ADD         wasm_f32x4_add
 #define GGML_F16x4_MUL         wasm_f32x4_mul
 #define GGML_F16x4_REDUCE(res, x)                           \
 {                                                           \
    int offset = GGML_F16_ARR >> 1;                         \
    for (int i = 0; i < offset; ++i) {                      \
        x[i] = wasm_f32x4_add(x[i], x[offset+i]);           \
    }                                                       \
    offset >>= 1;                                           \
    for (int i = 0; i < offset; ++i) {                      \
        x[i] = wasm_f32x4_add(x[i], x[offset+i]);           \
    }                                                       \
    offset >>= 1;                                           \
    for (int i = 0; i < offset; ++i) {                      \
        x[i] = wasm_f32x4_add(x[i], x[offset+i]);           \
    }                                                       \
    res = (ggml_float) (wasm_f32x4_extract_lane(x[0], 0) +  \
          wasm_f32x4_extract_lane(x[0], 1) +                \
          wasm_f32x4_extract_lane(x[0], 2) +                \
          wasm_f32x4_extract_lane(x[0], 3));                \
 }
 #define GGML_F16_VEC                GGML_F16x4
 #define GGML_F16_VEC_ZERO           GGML_F16x4_ZERO
 #define GGML_F16_VEC_SET1           GGML_F16x4_SET1
 #define GGML_F16_VEC_LOAD(p, i)     GGML_F16x4_LOAD(p)
 #define GGML_F16_VEC_STORE(p, r, i) GGML_F16x4_STORE(p, r[i])
 #define GGML_F16_VEC_FMA            GGML_F16x4_FMA
 #define GGML_F16_VEC_ADD            GGML_F16x4_ADD
 #define GGML_F16_VEC_MUL            GGML_F16x4_MUL
 #define GGML_F16_VEC_REDUCE         GGML_F16x4_REDUCE
 #elif defined(__SSE3__)
 #define GGML_SIMD
 // F32 SSE
 #define GGML_F32_STEP 32
 #define GGML_F32_EPR  4
 #define GGML_F32x4         __m128
 #define GGML_F32x4_ZERO    _mm_setzero_ps()
 #define GGML_F32x4_SET1(x) _mm_set1_ps(x)
 #define GGML_F32x4_LOAD    _mm_loadu_ps
 #define GGML_F32x4_STORE   _mm_storeu_ps
 #if defined(__FMA__)
    // TODO: Does this work?
    #define GGML_F32x4_FMA(a, b, c) _mm_fmadd_ps(b, c, a)
 #else
    #define GGML_F32x4_FMA(a, b, c) _mm_add_ps(_mm_mul_ps(b, c), a)
 #endif
 #define GGML_F32x4_ADD     _mm_add_ps
 #define GGML_F32x4_MUL     _mm_mul_ps
 #define GGML_F32x4_REDUCE(res, x)                                 \
 {                                                                 \
    int offset = GGML_F32_ARR >> 1;                               \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = _mm_add_ps(x[i], x[offset+i]);                     \
    }                                                             \
    offset >>= 1;                                                 \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = _mm_add_ps(x[i], x[offset+i]);                     \
    }                                                             \
    offset >>= 1;                                                 \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = _mm_add_ps(x[i], x[offset+i]);                     \
    }                                                             \
    const __m128 t0 = _mm_hadd_ps(x[0], x[0]);                    \
    res = (ggml_float) _mm_cvtss_f32(_mm_hadd_ps(t0, t0));        \
 }
 // TODO: is this optimal ?
 #define GGML_F32_VEC        GGML_F32x4
 #define GGML_F32_VEC_ZERO   GGML_F32x4_ZERO
 #define GGML_F32_VEC_SET1   GGML_F32x4_SET1
 #define GGML_F32_VEC_LOAD   GGML_F32x4_LOAD
 #define GGML_F32_VEC_STORE  GGML_F32x4_STORE
 #define GGML_F32_VEC_FMA    GGML_F32x4_FMA
 #define GGML_F32_VEC_ADD    GGML_F32x4_ADD
 #define GGML_F32_VEC_MUL    GGML_F32x4_MUL
 #define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE
 // F16 SSE
 #define GGML_F16_STEP 32
 #define GGML_F16_EPR  4
 static inline __m128 __sse_f16x4_load(const ggml_fp16_t * x) {
    float tmp[4];
    tmp[0] = GGML_FP16_TO_FP32(x[0]);
    tmp[1] = GGML_FP16_TO_FP32(x[1]);
    tmp[2] = GGML_FP16_TO_FP32(x[2]);
    tmp[3] = GGML_FP16_TO_FP32(x[3]);
    return _mm_loadu_ps(tmp);
 }
 static inline void __sse_f16x4_store(ggml_fp16_t * x, __m128 y) {
    float arr[4];
    _mm_storeu_ps(arr, y);
    x[0] = GGML_FP32_TO_FP16(arr[0]);
    x[1] = GGML_FP32_TO_FP16(arr[1]);
    x[2] = GGML_FP32_TO_FP16(arr[2]);
    x[3] = GGML_FP32_TO_FP16(arr[3]);
 }
 #define GGML_F32Cx4             __m128
 #define GGML_F32Cx4_ZERO        _mm_setzero_ps()
 #define GGML_F32Cx4_SET1(x)     _mm_set1_ps(x)
 #define GGML_F32Cx4_LOAD(x)     __sse_f16x4_load(x)
 #define GGML_F32Cx4_STORE(x, y) __sse_f16x4_store(x, y)
 #define GGML_F32Cx4_FMA         GGML_F32x4_FMA
 #define GGML_F32Cx4_ADD         _mm_add_ps
 #define GGML_F32Cx4_MUL         _mm_mul_ps
 #define GGML_F32Cx4_REDUCE      GGML_F32x4_REDUCE
 #define GGML_F16_VEC                 GGML_F32Cx4
 #define GGML_F16_VEC_ZERO            GGML_F32Cx4_ZERO
 #define GGML_F16_VEC_SET1            GGML_F32Cx4_SET1
 #define GGML_F16_VEC_LOAD(p, i)      GGML_F32Cx4_LOAD(p)
 #define GGML_F16_VEC_STORE(p, r, i)  GGML_F32Cx4_STORE(p, r[i])
 #define GGML_F16_VEC_FMA             GGML_F32Cx4_FMA
 #define GGML_F16_VEC_ADD             GGML_F32Cx4_ADD
 #define GGML_F16_VEC_MUL             GGML_F32Cx4_MUL
 #define GGML_F16_VEC_REDUCE          GGML_F32Cx4_REDUCE
 #elif defined(__loongarch_asx)
 #define GGML_SIMD
 // F32 LASX
 #define GGML_F32_STEP 32
 #define GGML_F32_EPR  8
 #define GGML_F32x8         __m256
 #define GGML_F32x8_ZERO    (__m256)__lasx_xvldi(0)
 #define GGML_F32x8_SET1(x) (__m256)__lasx_xvreplfr2vr_s((x))
 #define GGML_F32x8_LOAD(x) (__m256)__lasx_xvld((x), 0)
 #define GGML_F32x8_STORE(x,y)   __lasx_xvst((y), (x), 0)
 #define GGML_F32x8_FMA(a, b, c) __lasx_xvfmadd_s(b, c, a)
 #define GGML_F32x8_ADD     __lasx_xvfadd_s
 #define GGML_F32x8_MUL     __lasx_xvfmul_s
 #define GGML_F32x8_REDUCE(res, x)                                 \
 do {                                                              \
    int offset = GGML_F32_ARR >> 1;                               \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = __lasx_xvfadd_s(x[i], x[offset+i]);                  \
    }                                                             \
    offset >>= 1;                                                 \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = __lasx_xvfadd_s(x[i], x[offset+i]);                  \
    }                                                             \
    offset >>= 1;                                                 \
    for (int i = 0; i < offset; ++i) {                            \
        x[i] = __lasx_xvfadd_s(x[i], x[offset+i]);                  \
    }                                                             \
    float *tmp_p = (float *)&x[0]; \
    res = tmp_p[0] + tmp_p[1] + tmp_p[2] + tmp_p[3] + tmp_p[4] + tmp_p[5] + tmp_p[6] + tmp_p[7];  \
 } while (0)
 // TODO: is this optimal ?
 #define GGML_F32_VEC        GGML_F32x8
 #define GGML_F32_VEC_ZERO   GGML_F32x8_ZERO
 #define GGML_F32_VEC_SET1   GGML_F32x8_SET1
 #define GGML_F32_VEC_LOAD   GGML_F32x8_LOAD
 #define GGML_F32_VEC_STORE  GGML_F32x8_STORE
 #define GGML_F32_VEC_FMA    GGML_F32x8_FMA
 #define GGML_F32_VEC_ADD    GGML_F32x8_ADD
 #define GGML_F32_VEC_MUL    GGML_F32x8_MUL
 #define GGML_F32_VEC_REDUCE GGML_F32x8_REDUCE
 // F16 LASX
 #define GGML_F16_STEP 32
 #define GGML_F16_EPR  8
 // F16 arithmetic is not supported by LASX, so we use F32 instead
 #define GGML_F32Cx8          __m256
 #define GGML_F32Cx8_ZERO    (__m256)__lasx_xvldi(0)
 #define GGML_F32Cx8_SET1(x) (__m256)__lasx_xvreplgr2vr_w((x))
 static inline __m256 __lasx_f32cx8_load(const ggml_fp16_t * x) {
    __m256i a;
    memcpy(&a, x, sizeof(ggml_fp16_t) * 8);
    a = __lasx_xvpermi_d(a, 0 | (1 << 4));
    return __lasx_xvfcvtl_s_h(a);
 }
 static inline void __lasx_f32cx8_store(ggml_fp16_t * x, __m256 y) {
    __m256i a = __lasx_xvfcvt_h_s(y, y);
    a = __lasx_xvpermi_d(a, 0 | (2 << 2));
    memcpy(x, &a, sizeof(ggml_fp16_t) * 8);
 }
 #define GGML_F32Cx8_LOAD(x)     __lasx_f32cx8_load(x)
 #define GGML_F32Cx8_STORE(x, y) __lasx_f32cx8_store(x, y)
 #define GGML_F32Cx8_FMA         GGML_F32x8_FMA
 #define GGML_F32Cx8_ADD         __lasx_xvfadd_s
 #define GGML_F32Cx8_MUL         __lasx_xvfmul_s
 #define GGML_F32Cx8_REDUCE      GGML_F32x8_REDUCE
 #define GGML_F16_VEC                GGML_F32Cx8
 #define GGML_F16_VEC_ZERO           GGML_F32Cx8_ZERO
 #define GGML_F16_VEC_SET1           GGML_F32Cx8_SET1
 #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
 #define GGML_F16_VEC_STORE(p, r, i) GGML_F32Cx8_STORE(p, r[i])
 #define GGML_F16_VEC_FMA            GGML_F32Cx8_FMA
 #define GGML_F16_VEC_ADD            GGML_F32Cx8_ADD
 #define GGML_F16_VEC_MUL            GGML_F32Cx8_MUL
 #define GGML_F16_VEC_REDUCE         GGML_F32Cx8_REDUCE
 #elif defined(__loongarch_sx)
 #define GGML_SIMD
 // F32 LSX
 #define GGML_F32_STEP 32
 #define GGML_F32_EPR  4
 #define GGML_F32x4         __m128
 #define GGML_F32x4_ZERO    __lsx_vldi(0)
 #define GGML_F32x4_SET1(x) __lsx_vinsgr2vr_w(__lsx_vldi(0),(x), 0)
 #define GGML_F32x4_LOAD(x) __lsx_vld((x), 0)
 #define GGML_F32x4_STORE((x),(y))   __lsx_vst((y), (x), 0)
 #define GGML_F32x4_FMA(a, b, c) __lsx_vfmadd_s(b, c, a)
 #define GGML_F32x4_ADD     __lsx_vfadd_s
 #define GGML_F32x4_MUL     __lsx_vfmul_s
 #define GGML_F32x4_REDUCE(res, x)                                                     \
 {                                                                                     \
    int offset = GGML_F32_ARR >> 1;                                                   \
    for (int i = 0; i < offset; ++i) {                                                \
        x[i] = __lsx_vfadd_s(x[i], x[offset + i]);                                    \
    }                                                                                 \
    offset >>= 1;                                                                     \
    for (int i = 0; i < offset; ++i) {                                                \
        x[i] = __lsx_vfadd_s(x[i], x[offset + i]);                                    \
    }                                                                                 \
    offset >>= 1;                                                                     \
    for (int i = 0; i < offset; ++i) {                                                \
        x[i] = __lsx_vfadd_s(x[i], x[offset + i]);                                    \
    }                                                                                 \
    __m128i tmp     = __lsx_vsrli_d((__m128i) x[0], 32);                              \
    tmp             = (__m128i) __lsx_vfadd_s((__m128) tmp, x[0]);                    \
    tmp             = __lsx_vpickev_w(__lsx_vldi(0), tmp);                            \
    const __m128 t0 = __lsx_vshuf4i_w(tmp, 0x88);                                     \
    tmp             = __lsx_vsrli_d((__m128i) t0, 32);                                \
    tmp             = (__m128i) __lsx_vfadd_s((__m128) tmp, t0);                      \
    tmp             = __lsx_vpickev_w(__lsx_vldi(0), tmp);                            \
    res             = (ggml_float) __lsx_vpickve2gr_w(__lsx_vshuf4i_w(tmp, 0x88), 0); \
 }
 #define GGML_F32_VEC        GGML_F32x4
 #define GGML_F32_VEC_ZERO   GGML_F32x4_ZERO
 #define GGML_F32_VEC_SET1   GGML_F32x4_SET1
 #define GGML_F32_VEC_LOAD   GGML_F32x4_LOAD
 #define GGML_F32_VEC_STORE  GGML_F32x4_STORE
 #define GGML_F32_VEC_FMA    GGML_F32x4_FMA
 #define GGML_F32_VEC_ADD    GGML_F32x4_ADD
 #define GGML_F32_VEC_MUL    GGML_F32x4_MUL
 #define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE
 // F16 LSX
 #define GGML_F16_STEP 32
 #define GGML_F16_EPR  4
 static inline __m128 __lsx_f16x4_load(const ggml_fp16_t * x) {
    float tmp[4];
    tmp[0] = GGML_FP16_TO_FP32(x[0]);
    tmp[1] = GGML_FP16_TO_FP32(x[1]);
    tmp[2] = GGML_FP16_TO_FP32(x[2]);
    tmp[3] = GGML_FP16_TO_FP32(x[3]);
    return __lsx_vld(tmp, 0);
 }
 static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
    float arr[4];
    __lsx_vst(y, arr, 0);
    x[0] = GGML_FP32_TO_FP16(arr[0]);
    x[1] = GGML_FP32_TO_FP16(arr[1]);
    x[2] = GGML_FP32_TO_FP16(arr[2]);
    x[3] = GGML_FP32_TO_FP16(arr[3]);
 }
 #define GGML_F32Cx4             __m128
 #define GGML_F32Cx4_ZERO        __lsx_vldi(0)
 #define GGML_F32Cx4_SET1(x)     __lsx_vinsgr2vr_w(__lsx_vldi(0),(x), 0)
 #define GGML_F32Cx4_LOAD(x)     __lsx_f16x4_load(x)
 #define GGML_F32Cx4_STORE(x, y) __lsx_f16x4_store(x, y)
 #define GGML_F32Cx4_FMA         GGML_F32x4_FMA
 #define GGML_F32Cx4_ADD         __lsx_vfadd_s
 #define GGML_F32Cx4_MUL         __lsx_vfmul_s
 #define GGML_F32Cx4_REDUCE      GGML_F32x4_REDUCE
 #define GGML_F16_VEC                 GGML_F32Cx4
 #define GGML_F16_VEC_ZERO            GGML_F32Cx4_ZERO
 #define GGML_F16_VEC_SET1            GGML_F32Cx4_SET1
 #define GGML_F16_VEC_LOAD(p, i)      GGML_F32Cx4_LOAD(p)
 #define GGML_F16_VEC_STORE(p, r, i)  GGML_F32Cx4_STORE(p, r[i])
 #define GGML_F16_VEC_FMA             GGML_F32Cx4_FMA
 #define GGML_F16_VEC_ADD             GGML_F32Cx4_ADD
 #define GGML_F16_VEC_MUL             GGML_F32Cx4_MUL
 #define GGML_F16_VEC_REDUCE          GGML_F32Cx4_REDUCE
 #elif defined(__VXE__) || defined(__VXE2__)
 #define GGML_SIMD
 // F32 s390x
 #define GGML_F32_STEP 32
 #define GGML_F32_EPR  4
 #define GGML_F32x4              __vector float
 #define GGML_F32x4_ZERO         vec_splats(0.0f)
 #define GGML_F32x4_SET1         vec_splats
 #define GGML_F32x4_LOAD(p)      vec_xl(0, p)
 #define GGML_F32x4_STORE(p, r)  vec_xst(r, 0, p)
 #define GGML_F32x4_FMA(a, b, c) vec_madd(b, c, a)
 #define GGML_F32x4_ADD          vec_add
 #define GGML_F32x4_MUL          vec_mul
 #define GGML_F32x4_REDUCE(res, x)                   \
 {                                                   \
    int offset = GGML_F32_ARR >> 1;                 \
    for (int i = 0; i < offset; ++i) {              \
        x[i] = vec_add(x[i], x[offset + i]);        \
    }                                               \
    offset >>= 1;                                   \
    for (int i = 0; i < offset; ++i) {              \
        x[i] = vec_add(x[i], x[offset + i]);        \
    }                                               \
    offset >>= 1;                                   \
    for (int i = 0; i < offset; ++i) {              \
        x[i] = vec_add(x[i], x[offset + i]);        \
    }                                               \
    res = vec_extract(x[0], 0) +                    \
          vec_extract(x[0], 1) +                    \
          vec_extract(x[0], 2) +                    \
          vec_extract(x[0], 3);                     \
 }
 #define GGML_F32_VEC        GGML_F32x4
 #define GGML_F32_VEC_ZERO   GGML_F32x4_ZERO
 #define GGML_F32_VEC_SET1   GGML_F32x4_SET1
 #define GGML_F32_VEC_LOAD   GGML_F32x4_LOAD
 #define GGML_F32_VEC_STORE  GGML_F32x4_STORE
 #define GGML_F32_VEC_FMA    GGML_F32x4_FMA
 #define GGML_F32_VEC_ADD    GGML_F32x4_ADD
 #define GGML_F32_VEC_MUL    GGML_F32x4_MUL
 #define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE
 // F16 s390x
 #define GGML_F16_STEP GGML_F32_STEP
 #define GGML_F16_EPR  GGML_F32_EPR
 static inline __vector float __lzs_f16cx4_load(const ggml_fp16_t * x) {
    float tmp[4];
    for (int i = 0; i < 4; i++) {
        tmp[i] = GGML_FP16_TO_FP32(x[i]);
    }
    return vec_xl(0, tmp);
 }
 static inline void __lzs_f16cx4_store(ggml_fp16_t * x, __vector float y) {
    float arr[4];
    vec_xst(y, 0, arr);
    for (int i = 0; i < 4; i++) {
        x[i] = GGML_FP32_TO_FP16(arr[i]);
    }
 }
 #define GGML_F16_VEC                GGML_F32x4
 #define GGML_F16_VEC_ZERO           GGML_F32x4_ZERO
 #define GGML_F16_VEC_SET1           GGML_F32x4_SET1
 #define GGML_F16_VEC_LOAD(p, i)     __lzs_f16cx4_load(p)
 #define GGML_F16_VEC_STORE(p, r, i) __lzs_f16cx4_store(p, r[i])
 #define GGML_F16_VEC_FMA            GGML_F32x4_FMA
 #define GGML_F16_VEC_ADD            GGML_F32x4_ADD
 #define GGML_F16_VEC_MUL            GGML_F32x4_MUL
 #define GGML_F16_VEC_REDUCE         GGML_F32x4_REDUCE
 #endif
 // GGML_F32_ARR / GGML_F16_ARR
 //   number of registers to use per step
 #ifdef GGML_SIMD
 #define GGML_F32_ARR (GGML_F32_STEP/GGML_F32_EPR)
 #define GGML_F16_ARR (GGML_F16_STEP/GGML_F16_EPR)
 #endif
--- a/ggml/src/ggml-cpu/vec.cpp
+++ b/ggml/src/ggml-cpu/vec.cpp
@ -0,0 +1,258 @@
 #include "vec.h"
 #include <cassert>
 #if defined(_MSC_VER)
 // disable "possible loss of data" to avoid hundreds of casts
 // we should just be careful :)
 #pragma warning(disable: 4244 4267)
 #endif
 // precomputed gelu table for f16 (128 KB)
 ggml_fp16_t ggml_table_gelu_f16[1 << 16];
 // precomputed quick gelu table for f16 (128 KB)
 ggml_fp16_t ggml_table_gelu_quick_f16[1 << 16];
 void ggml_vec_dot_f32(int n, float * GGML_RESTRICT s, size_t bs, const float * GGML_RESTRICT x, size_t bx, const float * GGML_RESTRICT y, size_t by, int nrc) {
   assert(nrc == 1);
   GGML_UNUSED(nrc);
   GGML_UNUSED(bx);
   GGML_UNUSED(by);
   GGML_UNUSED(bs);
 #if defined(GGML_SIMD)
    float sumf = 0.0f;
    const int np = (n & ~(GGML_F32_STEP - 1));
    GGML_F32_VEC sum[GGML_F32_ARR] = { GGML_F32_VEC_ZERO };
    GGML_F32_VEC ax[GGML_F32_ARR];
    GGML_F32_VEC ay[GGML_F32_ARR];
    for (int i = 0; i < np; i += GGML_F32_STEP) {
        for (int j = 0; j < GGML_F32_ARR; j++) {
            ax[j] = GGML_F32_VEC_LOAD(x + i + j*GGML_F32_EPR);
            ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
            sum[j] = GGML_F32_VEC_FMA(sum[j], ax[j], ay[j]);
        }
    }
    // reduce sum0..sum3 to sum0
    GGML_F32_VEC_REDUCE(sumf, sum);
    // leftovers
    for (int i = np; i < n; ++i) {
        sumf += x[i]*y[i];
    }
 #else
    // scalar
    ggml_float sumf = 0.0;
    for (int i = 0; i < n; ++i) {
        sumf += (ggml_float)(x[i]*y[i]);
    }
 #endif
    *s = sumf;
 }
 void ggml_vec_dot_bf16(int n, float * GGML_RESTRICT s, size_t bs, ggml_bf16_t * GGML_RESTRICT x, size_t bx, ggml_bf16_t * GGML_RESTRICT y, size_t by, int nrc) {
    assert(nrc == 1);
    GGML_UNUSED(nrc);
    GGML_UNUSED(bx);
    GGML_UNUSED(by);
    GGML_UNUSED(bs);
    int i = 0;
    ggml_float sumf = 0;
 #if defined(__AVX512BF16__)
    __m512 c1 = _mm512_setzero_ps();
    __m512 c2 = _mm512_setzero_ps();
    for (; i + 64 <= n; i += 64) {
        c1 = _mm512_dpbf16_ps(c1, m512bh(_mm512_loadu_si512((x + i))),
                             m512bh(_mm512_loadu_si512((y + i))));
        c2 = _mm512_dpbf16_ps(c2, m512bh(_mm512_loadu_si512((x + i + 32))),
                             m512bh(_mm512_loadu_si512((y + i + 32))));
    }
    sumf += (ggml_float)_mm512_reduce_add_ps(c1);
    sumf += (ggml_float)_mm512_reduce_add_ps(c2);
 #elif defined(__AVX512F__)
 #define LOAD(p) _mm512_castsi512_ps(_mm512_slli_epi32(_mm512_cvtepu16_epi32(_mm256_loadu_si256((const __m256i *)(p))), 16))
    __m512 c1 = _mm512_setzero_ps();
    __m512 c2 = _mm512_setzero_ps();
    for (; i + 32 <= n; i += 32) {
        c1 = _mm512_add_ps(_mm512_mul_ps(LOAD(x + i), LOAD(y + i)), c1);
        c2 = _mm512_add_ps(_mm512_mul_ps(LOAD(x + i + 16), LOAD(y + i + 16)), c2);
    }
    sumf += (ggml_float)_mm512_reduce_add_ps(c1);
    sumf += (ggml_float)_mm512_reduce_add_ps(c2);
 #undef LOAD
 #elif defined(__AVX2__) || defined(__AVX__)
 #if defined(__AVX2__)
 #define LOAD(p) _mm256_castsi256_ps(_mm256_slli_epi32(_mm256_cvtepu16_epi32(_mm_loadu_si128((const __m128i *)(p))), 16))
 #else
 #define LOAD(p) _mm256_castsi256_ps(_mm256_insertf128_si256(_mm256_castsi128_si256(_mm_slli_epi32(_mm_cvtepu16_epi32(_mm_loadu_si128((const __m128i *)(p))), 16)), (_mm_slli_epi32(_mm_cvtepu16_epi32(_mm_bsrli_si128(_mm_loadu_si128((const __m128i *)(p)), 8)), 16)), 1))
 #endif
    __m256 c1 = _mm256_setzero_ps();
    __m256 c2 = _mm256_setzero_ps();
    __m256 c3 = _mm256_setzero_ps();
    __m256 c4 = _mm256_setzero_ps();
    for (; i + 32 <= n; i += 32) {
        c1 = _mm256_add_ps(_mm256_mul_ps(LOAD(x + i), LOAD(y + i)), c1);
        c2 = _mm256_add_ps(_mm256_mul_ps(LOAD(x + i + 8), LOAD(y + i + 8)), c2);
        c3 = _mm256_add_ps(_mm256_mul_ps(LOAD(x + i + 16), LOAD(y + i + 16)), c3);
        c4 = _mm256_add_ps(_mm256_mul_ps(LOAD(x + i + 24), LOAD(y + i + 24)), c4);
    }
    __m128 g;
    c1 = _mm256_add_ps(_mm256_add_ps(c1, c3),
                       _mm256_add_ps(c2, c4));
    g = _mm_add_ps(_mm256_extractf128_ps(c1, 1),
                   _mm256_castps256_ps128(c1));
    g = _mm_add_ps(g, _mm_movehl_ps(g, g));
    g = _mm_add_ss(g, _mm_movehdup_ps(g));
    sumf += (ggml_float)_mm_cvtss_f32(g);
 #undef LOAD
 #endif
    for (; i < n; ++i) {
        sumf += (ggml_float)(GGML_BF16_TO_FP32(x[i]) *
                             GGML_BF16_TO_FP32(y[i]));
    }
    *s = sumf;
 }
 void ggml_vec_dot_f16(int n, float * GGML_RESTRICT s, size_t bs, ggml_fp16_t * GGML_RESTRICT x, size_t bx, ggml_fp16_t * GGML_RESTRICT y, size_t by, int nrc) {
    assert(nrc == 1);
    GGML_UNUSED(nrc);
    GGML_UNUSED(bx);
    GGML_UNUSED(by);
    GGML_UNUSED(bs);
    ggml_float sumf = 0.0;
 #if defined(GGML_SIMD)
    const int np = (n & ~(GGML_F16_STEP - 1));
    GGML_F16_VEC sum[GGML_F16_ARR] = { GGML_F16_VEC_ZERO };
    GGML_F16_VEC ax[GGML_F16_ARR];
    GGML_F16_VEC ay[GGML_F16_ARR];
    for (int i = 0; i < np; i += GGML_F16_STEP) {
        for (int j = 0; j < GGML_F16_ARR; j++) {
            ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
            ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
            sum[j] = GGML_F16_VEC_FMA(sum[j], ax[j], ay[j]);
        }
    }
    // reduce sum0..sum3 to sum0
    GGML_F16_VEC_REDUCE(sumf, sum);
    // leftovers
    for (int i = np; i < n; ++i) {
        sumf += (ggml_float)(GGML_FP16_TO_FP32(x[i])*GGML_FP16_TO_FP32(y[i]));
    }
 #else
    for (int i = 0; i < n; ++i) {
        sumf += (ggml_float)(GGML_FP16_TO_FP32(x[i])*GGML_FP16_TO_FP32(y[i]));
    }
 #endif
    *s = sumf;
 }
 void ggml_vec_silu_f32(const int n, float * y, const float * x) {
    int i = 0;
 #if defined(__AVX512F__) && defined(__AVX512DQ__)
    for (; i + 15 < n; i += 16) {
        _mm512_storeu_ps(y + i, ggml_v_silu(_mm512_loadu_ps(x + i)));
    }
 #elif defined(__AVX2__) && defined(__FMA__)
    for (; i + 7 < n; i += 8) {
        _mm256_storeu_ps(y + i, ggml_v_silu(_mm256_loadu_ps(x + i)));
    }
 #elif defined(__SSE2__)
    for (; i + 3 < n; i += 4) {
        _mm_storeu_ps(y + i, ggml_v_silu(_mm_loadu_ps(x + i)));
    }
 #elif defined(__ARM_NEON) && defined(__aarch64__)
    for (; i + 3 < n; i += 4) {
        vst1q_f32(y + i, ggml_v_silu(vld1q_f32(x + i)));
    }
 #endif
    for (; i < n; ++i) {
        y[i] = ggml_silu_f32(x[i]);
    }
 }
 ggml_float ggml_vec_soft_max_f32(const int n, float * y, const float * x, float max) {
    int i = 0;
    ggml_float sum = 0;
 #if defined(__AVX512F__) && defined(__AVX512DQ__)
    for (; i + 15 < n; i += 16) {
        __m512 val = ggml_v_expf(_mm512_sub_ps(_mm512_loadu_ps(x + i),
                                               _mm512_set1_ps(max)));
        _mm512_storeu_ps(y + i, val);
        sum += (ggml_float)_mm512_reduce_add_ps(val);
    }
 #elif defined(__AVX2__) && defined(__FMA__)
    for (; i + 7 < n; i += 8) {
        __m256 val = ggml_v_expf(_mm256_sub_ps(_mm256_loadu_ps(x + i),
                                               _mm256_set1_ps(max)));
        _mm256_storeu_ps(y + i, val);
        __m128 val2 = _mm_add_ps(_mm256_extractf128_ps(val, 1),
                                 _mm256_castps256_ps128(val));
        val2 = _mm_add_ps(val2, _mm_movehl_ps(val2, val2));
        val2 = _mm_add_ss(val2, _mm_movehdup_ps(val2));
        sum += (ggml_float)_mm_cvtss_f32(val2);
    }
 #elif defined(__SSE2__)
    for (; i + 3 < n; i += 4) {
        __m128 val = ggml_v_expf(_mm_sub_ps(_mm_loadu_ps(x + i),
                                            _mm_set1_ps(max)));
        _mm_storeu_ps(y + i, val);
 #if defined(__AVX__) || defined(__AVX2__) || defined(__AVX512F__)
        val = _mm_add_ps(val, _mm_movehl_ps(val, val));
        val = _mm_add_ss(val, _mm_movehdup_ps(val));
 #else
        __m128 tmp = _mm_shuffle_ps(val, val, _MM_SHUFFLE(2, 3, 0, 1));
        val = _mm_add_ps(val, tmp);
        tmp = _mm_movehl_ps(tmp, val);
        val = _mm_add_ss(val, tmp);
 #endif
        sum += (ggml_float)_mm_cvtss_f32(val);
    }
 #elif defined(__ARM_NEON) && defined(__aarch64__)
    for (; i + 3 < n; i += 4) {
        float32x4_t val = ggml_v_expf(vsubq_f32(vld1q_f32(x + i),
                                                vdupq_n_f32(max)));
        vst1q_f32(y + i, val);
        sum += (ggml_float)vaddvq_f32(val);
    }
 #endif
    for (; i < n; ++i) {
        float val = expf(x[i] - max);
        sum += (ggml_float)val;
        y[i] = val;
    }
    return sum;
 }
 ggml_float ggml_vec_log_soft_max_f32(const int n, float * y, const float * x, float max) {
    // log(soft_max) = log(soft_max_i / soft_max_sum) = log(soft_max_i) - log(soft_max_sum) = (logit_i - max) - log(soft_max_i)
    int i = 0;
    ggml_float sum = 0;
    for (; i < n; ++i) {
        float val = x[i] - max;
        y[i] = val;
        sum += (ggml_float)expf(val);
    }
    return sum = (ggml_float)logf(sum);
 }
--- a/ggml/src/ggml-cpu/vec.h
+++ b/ggml/src/ggml-cpu/vec.h
@ -0,0 +1,802 @@
 // Vectorized functions for fundamental operations
 #pragma once
 #include "ggml-impl.h"
 #include "simd-mappings.h"
 #include "ggml.h"
 #if defined(GGML_USE_ACCELERATE)
 #include <Accelerate/Accelerate.h>
 #endif
 // floating point type used to accumulate sums
 typedef double ggml_float;
 #define GGML_GELU_FP16
 #define GGML_GELU_QUICK_FP16
 #define GGML_SOFT_MAX_UNROLL 4
 #define GGML_VEC_DOT_UNROLL  2
 #define GGML_VEC_MAD_UNROLL  32
 #ifdef __cplusplus
 extern "C" {
 #endif
 //
 // global data
 //
 // precomputed gelu table for f16 (128 KB)
 extern ggml_fp16_t ggml_table_gelu_f16[1 << 16];
 // precomputed quick gelu table for f16 (128 KB)
 extern ggml_fp16_t ggml_table_gelu_quick_f16[1 << 16];
 //
 // fundamental operations
 //
 void ggml_vec_dot_f32(int n, float * GGML_RESTRICT s, size_t bs, const float * GGML_RESTRICT x, size_t bx, const float * GGML_RESTRICT y, size_t by, int nrc);
 void ggml_vec_dot_bf16(int n, float * GGML_RESTRICT s, size_t bs, ggml_bf16_t * GGML_RESTRICT x, size_t bx, ggml_bf16_t * GGML_RESTRICT y, size_t by, int nrc);
 void ggml_vec_dot_f16(int n, float * GGML_RESTRICT s, size_t bs, ggml_fp16_t * GGML_RESTRICT x, size_t bx, ggml_fp16_t * GGML_RESTRICT y, size_t by, int nrc);
 void ggml_vec_silu_f32(const int n, float * y, const float * x);
 ggml_float ggml_vec_soft_max_f32(const int n, float * y, const float * x, float max);
 ggml_float ggml_vec_log_soft_max_f32(const int n, float * y, const float * x, float max);
 inline static void ggml_vec_set_i8(const int n, int8_t * x, const int8_t v) { for (int i = 0; i < n; ++i) x[i] = v; }
 inline static void ggml_vec_set_i16(const int n, int16_t * x, const int16_t v) { for (int i = 0; i < n; ++i) x[i] = v; }
 inline static void ggml_vec_set_i32(const int n, int32_t * x, const int32_t   v) { for (int i = 0; i < n; ++i) x[i] = v;    }
 inline static void ggml_vec_cpy_i32(const int n, int32_t * y, const int32_t * x) { for (int i = 0; i < n; ++i) y[i] = x[i]; }
 inline static void ggml_vec_set_f16(const int n, ggml_fp16_t * x, const ggml_fp16_t v) { for (int i = 0; i < n; ++i) x[i] = v; }
 inline static void ggml_vec_set_bf16(const int n, ggml_bf16_t * x, const ggml_bf16_t v) { for (int i = 0; i < n; ++i) x[i] = v; }
 inline static void ggml_vec_add_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i]  = x[i] + y[i]; }
 inline static void ggml_vec_add_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
    for (int i = 0; i < n; ++i) {
        z[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(x[i]) + GGML_FP16_TO_FP32(y[i]));
    }
 }
 inline static void ggml_vec_add1_f32(const int n, float * z, const float * x, const float   v) { for (int i = 0; i < n; ++i) z[i]  = x[i] + v;    }
 inline static void ggml_vec_acc_f32 (const int n, float * y, const float * x)                  { for (int i = 0; i < n; ++i) y[i] += x[i];        }
 inline static void ggml_vec_acc1_f32(const int n, float * y, const float   v)                  { for (int i = 0; i < n; ++i) y[i] += v;           }
 inline static void ggml_vec_sub_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i]  = x[i] - y[i]; }
 inline static void ggml_vec_sub_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
    for (int i = 0; i < n; ++i) {
        z[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(x[i]) - GGML_FP16_TO_FP32(y[i]));
    }
 }
 inline static void ggml_vec_set_f32 (const int n, float * x, const float   v)                  { for (int i = 0; i < n; ++i) x[i]  = v;           }
 inline static void ggml_vec_cpy_f32 (const int n, float * y, const float * x)                  { for (int i = 0; i < n; ++i) y[i]  = x[i];        }
 inline static void ggml_vec_neg_f32 (const int n, float * y, const float * x)                  { for (int i = 0; i < n; ++i) y[i]  = -x[i];       }
 inline static void ggml_vec_neg_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(-GGML_FP16_TO_FP32(x[i]));
    }
 }
 inline static void ggml_vec_mul_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i]  = x[i]*y[i];   }
 inline static void ggml_vec_mul_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
    for (int i = 0; i < n; ++i) {
        z[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(x[i]) * GGML_FP16_TO_FP32(y[i]));
    }
 }
 inline static void ggml_vec_div_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i]  = x[i]/y[i];   }
 inline static void ggml_vec_div_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
    for (int i = 0; i < n; ++i) {
        z[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(x[i]) / GGML_FP16_TO_FP32(y[i]));
    }
 }
 // compute GGML_VEC_DOT_UNROLL dot products at once
 // xs - x row stride in bytes
 inline static void ggml_vec_dot_f16_unroll(const int n, const int xs, float * GGML_RESTRICT s, void * GGML_RESTRICT xv, ggml_fp16_t * GGML_RESTRICT y) {
    ggml_float sumf[GGML_VEC_DOT_UNROLL] = { 0.0 };
    ggml_fp16_t * GGML_RESTRICT x[GGML_VEC_DOT_UNROLL];
    for (int i = 0; i < GGML_VEC_DOT_UNROLL; ++i) {
        x[i] = (ggml_fp16_t *) ((char *) xv + i*xs);
    }
 #if defined(GGML_SIMD)
    const int np = (n & ~(GGML_F16_STEP - 1));
    GGML_F16_VEC sum[GGML_VEC_DOT_UNROLL][GGML_F16_ARR] = { { GGML_F16_VEC_ZERO } };
    GGML_F16_VEC ax[GGML_F16_ARR];
    GGML_F16_VEC ay[GGML_F16_ARR];
    for (int i = 0; i < np; i += GGML_F16_STEP) {
        for (int j = 0; j < GGML_F16_ARR; j++) {
            ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
            for (int k = 0; k < GGML_VEC_DOT_UNROLL; ++k) {
                ax[j] = GGML_F16_VEC_LOAD(x[k] + i + j*GGML_F16_EPR, j);
                sum[k][j] = GGML_F16_VEC_FMA(sum[k][j], ax[j], ay[j]);
            }
        }
    }
    // reduce sum0..sum3 to sum0
    for (int k = 0; k < GGML_VEC_DOT_UNROLL; ++k) {
        GGML_F16_VEC_REDUCE(sumf[k], sum[k]);
    }
    // leftovers
    for (int i = np; i < n; ++i) {
        for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
            sumf[j] += (ggml_float)(GGML_FP16_TO_FP32(x[j][i])*GGML_FP16_TO_FP32(y[i]));
        }
    }
 #else
    for (int i = 0; i < n; ++i) {
        for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
            sumf[j] += (ggml_float)(GGML_FP16_TO_FP32(x[j][i])*GGML_FP16_TO_FP32(y[i]));
        }
    }
 #endif
    for (int i = 0; i < GGML_VEC_DOT_UNROLL; ++i) {
        s[i] = (float)sumf[i];
    }
 }
 inline static void ggml_vec_mad_f32(const int n, float * GGML_RESTRICT y, const float * GGML_RESTRICT x, const float v) {
 #if defined(GGML_SIMD)
    const int np = (n & ~(GGML_F32_STEP - 1));
    GGML_F32_VEC vx = GGML_F32_VEC_SET1(v);
    GGML_F32_VEC ax[GGML_F32_ARR];
    GGML_F32_VEC ay[GGML_F32_ARR];
    for (int i = 0; i < np; i += GGML_F32_STEP) {
        for (int j = 0; j < GGML_F32_ARR; j++) {
            ax[j] = GGML_F32_VEC_LOAD(x + i + j*GGML_F32_EPR);
            ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
            ay[j] = GGML_F32_VEC_FMA(ay[j], ax[j], vx);
            GGML_F32_VEC_STORE(y + i + j*GGML_F32_EPR, ay[j]);
        }
    }
    // leftovers
    for (int i = np; i < n; ++i) {
        y[i] += x[i]*v;
    }
 #else
    // scalar
    for (int i = 0; i < n; ++i) {
        y[i] += x[i]*v;
    }
 #endif
 }
 inline static void ggml_vec_mad_f16(const int n, ggml_fp16_t * GGML_RESTRICT y, const ggml_fp16_t * GGML_RESTRICT x, const float v) {
 #if defined(GGML_SIMD)
    const int np = (n & ~(GGML_F16_STEP - 1));
    GGML_F16_VEC vx = GGML_F16_VEC_SET1(v);
    GGML_F16_VEC ax[GGML_F16_ARR];
    GGML_F16_VEC ay[GGML_F16_ARR];
    for (int i = 0; i < np; i += GGML_F16_STEP) {
        for (int j = 0; j < GGML_F16_ARR; j++) {
            ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
            ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
            ay[j] = GGML_F16_VEC_FMA(ay[j], ax[j], vx);
            GGML_F16_VEC_STORE(y + i + j*GGML_F16_EPR, ay, j);
        }
    }
    // leftovers
    for (int i = np; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(y[i]) + GGML_FP16_TO_FP32(x[i])*v);
    }
 #else
    // scalar
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(y[i]) + GGML_FP16_TO_FP32(x[i])*v);
    }
 #endif
 }
 // xs and vs are byte strides of x and v
 inline static void ggml_vec_mad_f32_unroll(const int n, const int xs, const int vs, float * GGML_RESTRICT y, const float * GGML_RESTRICT xv, const float * GGML_RESTRICT vv) {
    const float * GGML_RESTRICT x[GGML_VEC_MAD_UNROLL];
    const float * GGML_RESTRICT v[GGML_VEC_MAD_UNROLL];
    for (int i = 0; i < GGML_VEC_MAD_UNROLL; ++i) {
        x[i] = (const float *) ((const char *) xv + i*xs);
        v[i] = (const float *) ((const char *) vv + i*vs);
    }
 #if defined(GGML_SIMD)
    const int np = (n & ~(GGML_F32_STEP - 1));
    GGML_F32_VEC vx[GGML_VEC_MAD_UNROLL];
    for (int k = 0; k < GGML_VEC_MAD_UNROLL; ++k) {
        vx[k] = GGML_F32_VEC_SET1(v[k][0]);
    }
    GGML_F32_VEC ax[GGML_VEC_MAD_UNROLL][GGML_F32_ARR];
    GGML_F32_VEC ay[GGML_F32_ARR];
    for (int i = 0; i < np; i += GGML_F32_STEP) {
        for (int j = 0; j < GGML_F32_ARR; j++) {
            ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
            for (int k = 0; k < GGML_VEC_MAD_UNROLL; ++k) {
                ax[k][j] = GGML_F32_VEC_LOAD(x[k] + i + j*GGML_F32_EPR);
                ay[j] = GGML_F32_VEC_FMA(ay[j], ax[k][j], vx[k]);
            }
            GGML_F32_VEC_STORE(y + i + j*GGML_F32_EPR, ay[j]);
        }
    }
    // leftovers
    for (int k = 0; k < GGML_VEC_MAD_UNROLL; ++k) {
        for (int i = np; i < n; ++i) {
            y[i] += x[k][i]*v[k][0];
        }
    }
 #else
    // scalar
    for (int k = 0; k < GGML_VEC_MAD_UNROLL; ++k) {
        for (int i = 0; i < n; ++i) {
            y[i] += x[k][i]*v[k][0];
        }
    }
 #endif
 }
 //inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) { for (int i = 0; i < n; ++i) y[i] *= v;          }
 inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) {
 #if defined(GGML_USE_ACCELERATE)
    vDSP_vsmul(y, 1, &v, y, 1, n);
 #elif defined(GGML_SIMD)
    const int np = (n & ~(GGML_F32_STEP - 1));
    GGML_F32_VEC vx = GGML_F32_VEC_SET1(v);
    GGML_F32_VEC ay[GGML_F32_ARR];
    for (int i = 0; i < np; i += GGML_F32_STEP) {
        for (int j = 0; j < GGML_F32_ARR; j++) {
            ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
            ay[j] = GGML_F32_VEC_MUL(ay[j], vx);
            GGML_F32_VEC_STORE(y + i + j*GGML_F32_EPR, ay[j]);
        }
    }
    // leftovers
    for (int i = np; i < n; ++i) {
        y[i] *= v;
    }
 #else
    // scalar
    for (int i = 0; i < n; ++i) {
        y[i] *= v;
    }
 #endif
 }
 inline static void ggml_vec_scale_f16(const int n, ggml_fp16_t * y, const float v) {
 #if defined(GGML_SIMD)
    const int np = (n & ~(GGML_F16_STEP - 1));
    GGML_F16_VEC vx = GGML_F16_VEC_SET1(v);
    GGML_F16_VEC ay[GGML_F16_ARR];
    for (int i = 0; i < np; i += GGML_F16_STEP) {
        for (int j = 0; j < GGML_F16_ARR; j++) {
            ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
            ay[j] = GGML_F16_VEC_MUL(ay[j], vx);
            GGML_F16_VEC_STORE(y + i + j*GGML_F16_EPR, ay, j);
        }
    }
    // leftovers
    for (int i = np; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(y[i])*v);
    }
 #else
    // scalar
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(y[i])*v);
    }
 #endif
 }
 inline static void ggml_vec_norm_f32 (const int n, float * s, const float * x) { ggml_vec_dot_f32(n, s, 0, x, 0, x, 0, 1); *s = sqrtf(*s);   }
 inline static void ggml_vec_sqr_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i]*x[i];   }
 inline static void ggml_vec_sqr_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        float v = GGML_FP16_TO_FP32(x[i]);
        y[i] = GGML_FP32_TO_FP16(v*v);
    }
 }
 inline static void ggml_vec_sqrt_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sqrtf(x[i]); }
 inline static void ggml_vec_sqrt_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(sqrtf(GGML_FP16_TO_FP32(x[i])));
    }
 }
 inline static void ggml_vec_log_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = logf(x[i]);  }
 inline static void ggml_vec_log_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(logf(GGML_FP16_TO_FP32(x[i])));
    }
 }
 inline static void ggml_vec_sin_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sinf(x[i]);  }
 inline static void ggml_vec_sin_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(sinf(GGML_FP16_TO_FP32(x[i])));
    }
 }
 inline static void ggml_vec_cos_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = cosf(x[i]);  }
 inline static void ggml_vec_cos_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(cosf(GGML_FP16_TO_FP32(x[i])));
    }
 }
 inline static void ggml_vec_abs_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = fabsf(x[i]); }
 inline static void ggml_vec_abs_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(fabsf(GGML_FP16_TO_FP32(x[i])));
    }
 }
 inline static void ggml_vec_sgn_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : ((x[i] < 0.f) ? -1.f : 0.f); }
 inline static void ggml_vec_sgn_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        float v = GGML_FP16_TO_FP32(x[i]);
        y[i] = GGML_FP32_TO_FP16((v > 0.f) ? 1.f : ((v < 0.f) ? -1.f : 0.f));
    }
 }
 inline static void ggml_vec_step_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : 0.f; }
 inline static void ggml_vec_step_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16((GGML_FP16_TO_FP32(x[i]) > 0.f) ? 1.f : 0.f);
    }
 }
 inline static void ggml_vec_tanh_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = tanhf(x[i]);  }
 inline static void ggml_vec_tanh_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(tanhf(GGML_FP16_TO_FP32(x[i])));
    }
 }
 inline static void ggml_vec_elu_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? x[i] : expm1f(x[i]); }
 inline static void ggml_vec_elu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(expm1f(GGML_FP16_TO_FP32(x[i])));
    }
 }
 inline static void ggml_vec_relu_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? x[i] : 0.f; }
 inline static void ggml_vec_relu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        float v = GGML_FP16_TO_FP32(x[i]);
        y[i] = GGML_FP32_TO_FP16((v > 0.f) ? v : 0.f);
    }
 }
 inline static void ggml_vec_leaky_relu_f32 (const int n, float * y, const float * x, const float ns) { for (int i = 0; i < n; ++i) y[i] = ((x[i] > 0.f) ? x[i] : 0.f) + ns * ((x[i] < 0.0f) ? x[i] : 0.f); }
 inline static void ggml_vec_leaky_relu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x, const float ns) {
    for (int i = 0; i < n; ++i) {
        float v = GGML_FP16_TO_FP32(x[i]);
        y[i] = GGML_FP32_TO_FP16(((v > 0.f) ? v : 0.f) + ns * ((v < 0.0f) ? v : 0.f));
    }
 }
 inline static void ggml_vec_sigmoid_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = 1.f / (1.f + expf(-x[i])); }
 inline static void ggml_vec_sigmoid_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(1.f / (1.f + expf(-GGML_FP16_TO_FP32(x[i]))));
    }
 }
 // TODO: optimize performance
 inline static void ggml_vec_hardswish_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i] * fminf(1.0f, fmaxf(0.0f, (x[i] + 3.0f) / 6.0f)); }
 inline static void ggml_vec_hardswish_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        float v = GGML_FP16_TO_FP32(x[i]);
        y[i] = GGML_FP32_TO_FP16(v * fminf(1.0f, fmaxf(0.0f, (v + 3.0f) / 6.0f)));
    }
 }
 inline static void ggml_vec_hardsigmoid_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = fminf(1.0f, fmaxf(0.0f, (x[i] + 3.0f) / 6.0f)); }
 inline static void ggml_vec_hardsigmoid_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(fminf(1.0f, fmaxf(0.0f, (GGML_FP16_TO_FP32(x[i]) + 3.0f) / 6.0f)));
    }
 }
 inline static void ggml_vec_exp_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = expf(x[i]); }
 inline static void ggml_vec_exp_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(expf(GGML_FP16_TO_FP32(x[i])));
    }
 }
 static const float GELU_COEF_A     = 0.044715f;
 static const float GELU_QUICK_COEF = -1.702f;
 static const float SQRT_2_OVER_PI  = 0.79788456080286535587989211986876f;
 inline static float ggml_gelu_f32(float x) {
    return 0.5f*x*(1.0f + tanhf(SQRT_2_OVER_PI*x*(1.0f + GELU_COEF_A*x*x)));
 }
 inline static void ggml_vec_gelu_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    const uint16_t * i16 = (const uint16_t *) x;
    for (int i = 0; i < n; ++i) {
        y[i] = ggml_table_gelu_f16[i16[i]];
    }
 }
 #ifdef GGML_GELU_FP16
 inline static void ggml_vec_gelu_f32(const int n, float * y, const float * x) {
    uint16_t t;
    for (int i = 0; i < n; ++i) {
        if (x[i] <= -10.0f) {
            y[i] = 0.0f;
        } else if (x[i] >= 10.0f) {
            y[i] = x[i];
        } else {
            ggml_fp16_t fp16 = GGML_FP32_TO_FP16(x[i]);
            memcpy(&t, &fp16, sizeof(uint16_t));
            y[i] = GGML_FP16_TO_FP32(ggml_table_gelu_f16[t]);
        }
    }
 }
 #else
 inline static void ggml_vec_gelu_f32(const int n, float * y, const float * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = ggml_gelu_f32(x[i]);
    }
 }
 #endif
 inline static float ggml_gelu_quick_f32(float x) {
    return x*(1.0f/(1.0f+expf(GELU_QUICK_COEF*x)));
 }
 //inline static void ggml_vec_gelu_quick_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
 //    const uint16_t * i16 = (const uint16_t *) x;
 //    for (int i = 0; i < n; ++i) {
 //        y[i] = ggml_table_gelu_quick_f16[i16[i]];
 //    }
 //}
 #ifdef GGML_GELU_QUICK_FP16
 inline static void ggml_vec_gelu_quick_f32(const int n, float * y, const float * x) {
    uint16_t t;
    for (int i = 0; i < n; ++i) {
        ggml_fp16_t fp16 = GGML_FP32_TO_FP16(x[i]);
        memcpy(&t, &fp16, sizeof(uint16_t));
        y[i] = GGML_FP16_TO_FP32(ggml_table_gelu_quick_f16[t]);
    }
 }
 #else
 inline static void ggml_vec_gelu_quick_f32(const int n, float * y, const float * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = ggml_gelu_quick_f32(x[i]);
    }
 }
 #endif
 inline static void ggml_vec_gelu_quick_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        float v = GGML_FP16_TO_FP32(x[i]);
        y[i] = GGML_FP32_TO_FP16(v*(1.0f/(1.0f+expf(GELU_QUICK_COEF*v))));
    }
 }
 // Sigmoid Linear Unit (SiLU) function
 inline static float ggml_silu_f32(float x) {
    return x/(1.0f + expf(-x));
 }
 inline static ggml_fp16_t ggml_silu_f16(ggml_fp16_t x) {
    float v = GGML_FP16_TO_FP32(x);
    return GGML_FP32_TO_FP16(v/(1.0f + expf(-v)));
 }
 #if __FINITE_MATH_ONLY__
 #error "some routines in ggml.c require non-finite math arithmetics -- pass -fno-finite-math-only to the compiler to fix"
 #error "ref: https://github.com/ggml-org/llama.cpp/pull/7154#issuecomment-2143844461"
 #endif
 #if defined(__ARM_NEON) && defined(__aarch64__)
 // adapted from arm limited optimized routine
 // the maximum error is 1.45358 plus 0.5 ulps
 // numbers above 88.38 will flush to infinity
 // numbers beneath -103.97 will flush to zero
 inline static float32x4_t ggml_v_expf(float32x4_t x) {
    const float32x4_t r = vdupq_n_f32(0x1.8p23f);
    const float32x4_t z = vfmaq_f32(r, x, vdupq_n_f32(0x1.715476p+0f));
    const float32x4_t n = vsubq_f32(z, r);
    const float32x4_t b = vfmsq_f32(vfmsq_f32(x, n, vdupq_n_f32(0x1.62e4p-1f)), n,
                                    vdupq_n_f32(0x1.7f7d1cp-20f));
    const uint32x4_t e = vshlq_n_u32(vreinterpretq_u32_f32(z), 23);
    const float32x4_t k = vreinterpretq_f32_u32(vaddq_u32(e, vreinterpretq_u32_f32(vdupq_n_f32(1))));
    const uint32x4_t c = vcagtq_f32(n, vdupq_n_f32(126));
    const float32x4_t u = vmulq_f32(b, b);
    const float32x4_t j = vfmaq_f32(
        vmulq_f32(vdupq_n_f32(0x1.ffffecp-1f), b),
        vfmaq_f32(vfmaq_f32(vdupq_n_f32(0x1.fffdb6p-2f), vdupq_n_f32(0x1.555e66p-3f), b),
                  vfmaq_f32(vdupq_n_f32(0x1.573e2ep-5f), vdupq_n_f32(0x1.0e4020p-7f), b), u), u);
    if (!vpaddd_u64(vreinterpretq_u64_u32(c)))
        return vfmaq_f32(k, j, k);
    const uint32x4_t d = vandq_u32(vclezq_f32(n), vdupq_n_u32(0x82000000));
    const float32x4_t s1 = vreinterpretq_f32_u32(vaddq_u32(d, vdupq_n_u32(0x7f000000)));
    const float32x4_t s2 = vreinterpretq_f32_u32(vsubq_u32(e, d));
    return vbslq_f32(vcagtq_f32(n, vdupq_n_f32(192)), vmulq_f32(s1, s1),
                     vbslq_f32(c, vmulq_f32(vfmaq_f32(s2, s2, j), s1), vfmaq_f32(k, k, j)));
 }
 // computes silu x/(1+exp(-x)) in single precision vector
 inline static float32x4_t ggml_v_silu(float32x4_t x) {
    const float32x4_t one = vdupq_n_f32(1.0f);
    const float32x4_t zero = vdupq_n_f32(0.0f);
    const float32x4_t neg_x = vsubq_f32(zero, x);
    const float32x4_t exp_neg_x = ggml_v_expf(neg_x);
    const float32x4_t one_plus_exp_neg_x = vaddq_f32(one, exp_neg_x);
    return vdivq_f32(x, one_plus_exp_neg_x);
 }
 #elif defined(__AVX512F__) && defined(__AVX512DQ__)
 // adapted from arm limited optimized routine
 // the maximum error is 1.45358 plus 0.5 ulps
 // numbers above 88.38 will flush to infinity
 // numbers beneath -103.97 will flush to zero
 inline static __m512 ggml_v_expf(__m512 x) {
  const __m512 r = _mm512_set1_ps(0x1.8p23f);
  const __m512 z = _mm512_fmadd_ps(x, _mm512_set1_ps(0x1.715476p+0f), r);
  const __m512 n = _mm512_sub_ps(z, r);
  const __m512 b =
      _mm512_fnmadd_ps(n, _mm512_set1_ps(0x1.7f7d1cp-20f),
                       _mm512_fnmadd_ps(n, _mm512_set1_ps(0x1.62e4p-1f), x));
  const __mmask16 d =
      _mm512_cmp_ps_mask(_mm512_abs_ps(n), _mm512_set1_ps(192), _CMP_GT_OQ);
  const __m512 u = _mm512_mul_ps(b, b);
  const __m512 j = _mm512_fmadd_ps(
      _mm512_fmadd_ps(_mm512_fmadd_ps(_mm512_set1_ps(0x1.0e4020p-7f), b,
                                      _mm512_set1_ps(0x1.573e2ep-5f)),
                      u,
                      _mm512_fmadd_ps(_mm512_set1_ps(0x1.555e66p-3f), b,
                                      _mm512_set1_ps(0x1.fffdb6p-2f))),
      u,
      _mm512_fmadd_ps(_mm512_set1_ps(0x1.ffffecp-1f), b, _mm512_set1_ps(1.0F)));
  const __m512 res = _mm512_scalef_ps(j, n);
  if (_mm512_kortestz(d, d))
    return res;
  const __m512 zero = _mm512_setzero_ps();
  const __m512 alt = _mm512_mask_blend_ps(
      _mm512_cmp_ps_mask(n, zero, _CMP_LE_OQ), _mm512_set1_ps(INFINITY), zero);
  return _mm512_mask_blend_ps(d, res, alt);
 }
 // computes silu x/(1+exp(-x)) in single precision vector
 inline static __m512 ggml_v_silu(__m512 x) {
    const __m512 one = _mm512_set1_ps(1);
    const __m512 zero = _mm512_setzero_ps();
    const __m512 neg_x = _mm512_sub_ps(zero, x);
    const __m512 exp_neg_x = ggml_v_expf(neg_x);
    const __m512 one_plus_exp_neg_x = _mm512_add_ps(one, exp_neg_x);
    return _mm512_div_ps(x, one_plus_exp_neg_x);
 }
 #elif defined(__AVX2__) && defined(__FMA__)
 // adapted from arm limited optimized routine
 // the maximum error is 1.45358 plus 0.5 ulps
 // numbers above 88.38 will flush to infinity
 // numbers beneath -103.97 will flush to zero
 inline static __m256 ggml_v_expf(__m256 x) {
  const __m256 r = _mm256_set1_ps(0x1.8p23f);
  const __m256 z = _mm256_fmadd_ps(x, _mm256_set1_ps(0x1.715476p+0f), r);
  const __m256 n = _mm256_sub_ps(z, r);
  const __m256 b = _mm256_fnmadd_ps(n, _mm256_set1_ps(0x1.7f7d1cp-20f),
                                    _mm256_fnmadd_ps(n, _mm256_set1_ps(0x1.62e4p-1f), x));
  const __m256i e = _mm256_slli_epi32(_mm256_castps_si256(z), 23);
  const __m256 k = _mm256_castsi256_ps(
      _mm256_add_epi32(e, _mm256_castps_si256(_mm256_set1_ps(1))));
  const __m256i c = _mm256_castps_si256(
      _mm256_cmp_ps(_mm256_andnot_ps(_mm256_set1_ps(-0.f), n),
                    _mm256_set1_ps(126), _CMP_GT_OQ));
  const __m256 u = _mm256_mul_ps(b, b);
  const __m256 j = _mm256_fmadd_ps(_mm256_fmadd_ps(_mm256_fmadd_ps(_mm256_set1_ps(0x1.0e4020p-7f), b,
                                                                   _mm256_set1_ps(0x1.573e2ep-5f)), u,
                                                   _mm256_fmadd_ps(_mm256_set1_ps(0x1.555e66p-3f), b,
                                                                   _mm256_set1_ps(0x1.fffdb6p-2f))),
                                   u, _mm256_mul_ps(_mm256_set1_ps(0x1.ffffecp-1f), b));
  if (!_mm256_movemask_ps(_mm256_castsi256_ps(c)))
    return _mm256_fmadd_ps(j, k, k);
  const __m256i g = _mm256_and_si256(
      _mm256_castps_si256(_mm256_cmp_ps(n, _mm256_setzero_ps(), _CMP_LE_OQ)),
      _mm256_set1_epi32(0x82000000u));
  const __m256 s1 =
      _mm256_castsi256_ps(_mm256_add_epi32(g, _mm256_set1_epi32(0x7f000000u)));
  const __m256 s2 = _mm256_castsi256_ps(_mm256_sub_epi32(e, g));
  const __m256i d = _mm256_castps_si256(
      _mm256_cmp_ps(_mm256_andnot_ps(_mm256_set1_ps(-0.f), n),
                    _mm256_set1_ps(192), _CMP_GT_OQ));
  return _mm256_or_ps(
      _mm256_and_ps(_mm256_castsi256_ps(d), _mm256_mul_ps(s1, s1)),
      _mm256_andnot_ps(
          _mm256_castsi256_ps(d),
          _mm256_or_ps(
              _mm256_and_ps(_mm256_castsi256_ps(c),
                            _mm256_mul_ps(_mm256_fmadd_ps(s2, j, s2), s1)),
              _mm256_andnot_ps(_mm256_castsi256_ps(c), _mm256_fmadd_ps(k, j, k)))));
 }
 // computes silu x/(1+exp(-x)) in single precision vector
 inline static __m256 ggml_v_silu(__m256 x) {
    const __m256 one = _mm256_set1_ps(1);
    const __m256 zero = _mm256_setzero_ps();
    const __m256 neg_x = _mm256_sub_ps(zero, x);
    const __m256 exp_neg_x = ggml_v_expf(neg_x);
    const __m256 one_plus_exp_neg_x = _mm256_add_ps(one, exp_neg_x);
    return _mm256_div_ps(x, one_plus_exp_neg_x);
 }
 #elif defined(__SSE2__) // __AVX2__ / __ARM_NEON
 #if defined(__FMA__)
 #define MADD128(x, y, z) _mm_fmadd_ps(x, y, z)
 #define NMADD128(x, y, z) _mm_fnmadd_ps(x, y, z)
 #else
 #define MADD128(x, y, z) _mm_add_ps(_mm_mul_ps(x, y), z)
 #define NMADD128(x, y, z) _mm_sub_ps(z, _mm_mul_ps(x, y))
 #endif
 // adapted from arm limited optimized routine
 // the maximum error is 1.45358 plus 0.5 ulps
 // numbers above 88.38 will flush to infinity
 // numbers beneath -103.97 will flush to zero
 inline static __m128 ggml_v_expf(__m128 x) {
    const __m128 r = _mm_set1_ps(0x1.8p23f);
    const __m128 z = MADD128(x, _mm_set1_ps(0x1.715476p+0f), r);
    const __m128 n = _mm_sub_ps(z, r);
    const __m128 b =
        NMADD128(n, _mm_set1_ps(0x1.7f7d1cp-20f), NMADD128(n, _mm_set1_ps(0x1.62e4p-1f), x));
    const __m128i e = _mm_slli_epi32(_mm_castps_si128(z), 23);
    const __m128 k = _mm_castsi128_ps(_mm_add_epi32(e, _mm_castps_si128(_mm_set1_ps(1))));
    const __m128i c =
        _mm_castps_si128(_mm_cmpgt_ps(_mm_andnot_ps(_mm_set1_ps(-0.f), n), _mm_set1_ps(126)));
    const __m128 u = _mm_mul_ps(b, b);
    const __m128 j =
        MADD128(MADD128(MADD128(_mm_set1_ps(0x1.0e4020p-7f), b, _mm_set1_ps(0x1.573e2ep-5f)), u,
                        MADD128(_mm_set1_ps(0x1.555e66p-3f), b, _mm_set1_ps(0x1.fffdb6p-2f))),
                u, _mm_mul_ps(_mm_set1_ps(0x1.ffffecp-1f), b));
    if (!_mm_movemask_epi8(c))
        return MADD128(j, k, k);
    const __m128i g = _mm_and_si128(_mm_castps_si128(_mm_cmple_ps(n, _mm_setzero_ps())),
                                    _mm_set1_epi32(0x82000000u));
    const __m128 s1 = _mm_castsi128_ps(_mm_add_epi32(g, _mm_set1_epi32(0x7f000000u)));
    const __m128 s2 = _mm_castsi128_ps(_mm_sub_epi32(e, g));
    const __m128i d =
        _mm_castps_si128(_mm_cmpgt_ps(_mm_andnot_ps(_mm_set1_ps(-0.f), n), _mm_set1_ps(192)));
    return _mm_or_ps(
        _mm_and_ps(_mm_castsi128_ps(d), _mm_mul_ps(s1, s1)),
        _mm_andnot_ps(_mm_castsi128_ps(d),
                      _mm_or_ps(_mm_and_ps(_mm_castsi128_ps(c), _mm_mul_ps(MADD128(s2, j, s2), s1)),
                                _mm_andnot_ps(_mm_castsi128_ps(c), MADD128(k, j, k)))));
 }
 // computes silu x/(1+exp(-x)) in single precision vector
 inline static __m128 ggml_v_silu(__m128 x) {
    const __m128 one = _mm_set1_ps(1);
    const __m128 zero = _mm_setzero_ps();
    const __m128 neg_x = _mm_sub_ps(zero, x);
    const __m128 exp_neg_x = ggml_v_expf(neg_x);
    const __m128 one_plus_exp_neg_x = _mm_add_ps(one, exp_neg_x);
    return _mm_div_ps(x, one_plus_exp_neg_x);
 }
 #endif // __ARM_NEON / __AVX2__ / __SSE2__
 inline static void ggml_vec_silu_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = ggml_silu_f16(x[i]);
    }
 }
 inline static float ggml_silu_backward_f32(float x, float dy) {
    const float s = 1.0f/(1.0f + expf(-x));
    return dy*s*(1.0f + x*(1.0f - s));
 }
 inline static ggml_fp16_t ggml_silu_backward_f16(ggml_fp16_t x, ggml_fp16_t dy) {
    const float v = GGML_FP16_TO_FP32(x);
    const float s = 1.0f/(1.0f + expf(-v));
    return GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(dy)*s*(1.0f + v*(1.0f - s)));
 }
 inline static void ggml_vec_silu_backward_f32(const int n, float * dx, const float * x, const float * dy) {
    for (int i = 0; i < n; ++i) {
        dx[i] = ggml_silu_backward_f32(x[i], dy[i]);
    }
 }
 inline static void ggml_vec_silu_backward_f16(const int n, ggml_fp16_t * dx, const ggml_fp16_t * x, const ggml_fp16_t * dy) {
    for (int i = 0; i < n; ++i) {
        dx[i] = ggml_silu_backward_f16(x[i], dy[i]);
    }
 }
 inline static void ggml_vec_sum_f32(const int n, float * s, const float * x) {
 #ifndef GGML_USE_ACCELERATE
    ggml_float sum = 0.0;
    for (int i = 0; i < n; ++i) {
        sum += (ggml_float)x[i];
    }
    *s = (float)sum;
 #else
    vDSP_sve(x, 1, s, n);
 #endif
 }
 inline static void ggml_vec_sum_f32_ggf(const int n, ggml_float * s, const float * x) {
    ggml_float sum = 0.0;
    for (int i = 0; i < n; ++i) {
        sum += (ggml_float)x[i];
    }
    *s = sum;
 }
 inline static void ggml_vec_sum_f16_ggf(const int n, float * s, const ggml_fp16_t * x) {
    float sum = 0.0f;
    for (int i = 0; i < n; ++i) {
        sum += GGML_FP16_TO_FP32(x[i]);
    }
    *s = sum;
 }
 inline static void ggml_vec_sum_bf16_ggf(const int n, float * s, const ggml_bf16_t * x) {
    float sum = 0.0f;
    for (int i = 0; i < n; ++i) {
        sum += GGML_BF16_TO_FP32(x[i]);
    }
    *s = sum;
 }
 inline static void ggml_vec_max_f32(const int n, float * s, const float * x) {
 #ifndef GGML_USE_ACCELERATE
    float max = -INFINITY;
    for (int i = 0; i < n; ++i) {
        max = MAX(max, x[i]);
    }
    *s = max;
 #else
    vDSP_maxv(x, 1, s, n);
 #endif
 }
 inline static void ggml_vec_norm_inv_f32(const int n, float * s, const float * x) {
    ggml_vec_norm_f32(n, s, x);
    *s = 1.f/(*s);
 }
 inline static void ggml_vec_argmax_f32(const int n, int * s, const float * x) {
    float max = -INFINITY;
    int idx = 0;
    for (int i = 0; i < n; ++i) {
        max = MAX(max, x[i]);
        if (max == x[i]) { idx = i; }
    }
    *s = idx;
 }
 #ifdef __cplusplus
 }
 #endif
--- a/models/README.md
+++ b/models/README.md
@ -25,7 +25,6 @@ You can now use it like this:
 `ggml` models are available from the following locations:
 - https://huggingface.co/ggerganov/whisper.cpp/tree/main
 - https://ggml.ggerganov.com
 ### 3. Convert with [convert-pt-to-ggml.py](convert-pt-to-ggml.py)
@ -78,7 +77,7 @@ OpenAI format. To read the HF models you can use the [convert-h5-to-ggml.py](con
 ```bash
 git clone https://github.com/openai/whisper
-git clone https://github.com/ggerganov/whisper.cpp
+git clone https://github.com/ggml-org/whisper.cpp
 # clone HF fine-tuned model (this is just an example)
 git clone https://huggingface.co/openai/whisper-medium
@ -96,7 +95,7 @@ Currently, the chunk-based transcription strategy is not implemented, so there c
 ```bash
 # clone OpenAI whisper and whisper.cpp
 git clone https://github.com/openai/whisper
-git clone https://github.com/ggerganov/whisper.cpp
+git clone https://github.com/ggml-org/whisper.cpp
 # get the models
 cd whisper.cpp/models
--- a/models/convert-h5-to-ggml.py
+++ b/models/convert-h5-to-ggml.py
@ -3,7 +3,7 @@
 # Usage:
 #
 #   git clone https://github.com/openai/whisper
-#   git clone https://github.com/ggerganov/whisper.cpp
+#   git clone https://github.com/ggml-org/whisper.cpp
 #   git clone https://huggingface.co/openai/whisper-medium
 #
 #   python3 ./whisper.cpp/models/convert-h5-to-ggml.py ./whisper-medium/ ./whisper .
@ -12,7 +12,7 @@
 #
 # For more info:
 #
-#   https://github.com/ggerganov/whisper.cpp/issues/157
+#   https://github.com/ggml-org/whisper.cpp/issues/157
 #
 import io
--- a/models/convert-whisper-to-coreml.py
+++ b/models/convert-whisper-to-coreml.py
@ -254,10 +254,10 @@ def convert_encoder(hparams, model, quantize=False):
    model = ct.convert(
        traced_model,
-        convert_to="neuralnetwork",
+        convert_to="mlprogram",
        inputs=[ct.TensorType(name="logmel_data", shape=input_shape)],
        outputs=[ct.TensorType(name="output")],
-        compute_units=ct.ComputeUnit.ALL
+        compute_units=ct.ComputeUnit.ALL,
    )
    if quantize:
@ -278,11 +278,11 @@ def convert_decoder(hparams, model, quantize=False):
    model = ct.convert(
        traced_model,
-        convert_to="neuralnetwork",
+        convert_to="mlprogram",
        inputs=[
            ct.TensorType(name="token_data", shape=tokens_shape, dtype=int),
            ct.TensorType(name="audio_data", shape=audio_shape)
-        ]
+        ],
    )
    if quantize:
--- a/scripts/sync-ggml.last
+++ b/scripts/sync-ggml.last
@ -1 +1 @@
-7d7aa2dee2eb55dc683af80b769b81a0642226a1
+d920dfd7da37b22d1eb0813cdaf340c1870d76c3
--- a/src/whisper.cpp
+++ b/src/whisper.cpp
@ -4276,11 +4276,11 @@ void whisper_print_timings(struct whisper_context * ctx) {
        WHISPER_LOG_INFO("%s:     fallbacks = %3d p / %3d h\n", __func__, ctx->state->n_fail_p, ctx->state->n_fail_h);
        WHISPER_LOG_INFO("%s:      mel time = %8.2f ms\n", __func__, ctx->state->t_mel_us / 1000.0f);
-        WHISPER_LOG_INFO("%s:   sample time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_sample_us, n_sample, 1e-3f * ctx->state->t_sample_us / n_sample);
+        WHISPER_LOG_INFO("%s:   sample time = %8.2f ms / %5d runs ( %8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_sample_us, n_sample, 1e-3f * ctx->state->t_sample_us / n_sample);
-        WHISPER_LOG_INFO("%s:   encode time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_encode_us, n_encode, 1e-3f * ctx->state->t_encode_us / n_encode);
+        WHISPER_LOG_INFO("%s:   encode time = %8.2f ms / %5d runs ( %8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_encode_us, n_encode, 1e-3f * ctx->state->t_encode_us / n_encode);
-        WHISPER_LOG_INFO("%s:   decode time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_decode_us, n_decode, 1e-3f * ctx->state->t_decode_us / n_decode);
+        WHISPER_LOG_INFO("%s:   decode time = %8.2f ms / %5d runs ( %8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_decode_us, n_decode, 1e-3f * ctx->state->t_decode_us / n_decode);
-        WHISPER_LOG_INFO("%s:   batchd time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_batchd_us, n_batchd, 1e-3f * ctx->state->t_batchd_us / n_batchd);
+        WHISPER_LOG_INFO("%s:   batchd time = %8.2f ms / %5d runs ( %8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_batchd_us, n_batchd, 1e-3f * ctx->state->t_batchd_us / n_batchd);
-        WHISPER_LOG_INFO("%s:   prompt time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_prompt_us, n_prompt, 1e-3f * ctx->state->t_prompt_us / n_prompt);
+        WHISPER_LOG_INFO("%s:   prompt time = %8.2f ms / %5d runs ( %8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_prompt_us, n_prompt, 1e-3f * ctx->state->t_prompt_us / n_prompt);
    }
    WHISPER_LOG_INFO("%s:    total time = %8.2f ms\n", __func__, (t_end_us - ctx->t_start_us)/1000.0f);
 }
@ -5527,11 +5527,13 @@ int whisper_full_with_state(
    const int seek_start = params.offset_ms/10;
    const int seek_end = params.duration_ms == 0 ? whisper_n_len_from_state(state) : seek_start + params.duration_ms/10;
-    // if length of spectrogram is less than 1.0s (100 frames), then return
+    // if length of spectrogram is less than 100ms (10 frames), then return
-    // basically don't process anything that is less than 1.0s
+    // basically don't process anything that is less than 100ms
-    // see issue #39: https://github.com/ggerganov/whisper.cpp/issues/39
+    // ref: https://github.com/ggml-org/whisper.cpp/issues/2065
-    if (seek_end < seek_start + 100) {
+    const int delta_min = 10;
-        WHISPER_LOG_WARN("%s: input is too short - %d ms < 1000 ms. consider padding the input audio with silence\n", __func__, (seek_end - seek_start)*10);
+
    if (seek_end < seek_start + delta_min) {
        WHISPER_LOG_WARN("%s: input is too short - %d ms < 100 ms. consider padding the input audio with silence\n", __func__, (seek_end - seek_start)*10);
        return 0;
    }
@ -5675,8 +5677,8 @@ int whisper_full_with_state(
                ctx, state, progress_cur, params.progress_callback_user_data);
        }
-        // if only 1 second left, then stop
+        // if only 100ms left, then stop
-        if (seek + 100 >= seek_end) {
+        if (seek + delta_min >= seek_end) {
            break;
        }
@ -6023,10 +6025,10 @@ int whisper_full_with_state(
                        // end of segment
                        if (token.id == whisper_token_eot(ctx) ||               // end of text token
                           (params.max_tokens > 0 && i >= params.max_tokens) || // max tokens per segment reached
-                           (has_ts && seek + seek_delta + 100 >= seek_end)      // end of audio reached
+                           (has_ts && seek + seek_delta + delta_min >= seek_end)       // end of audio reached (100ms)
                           ) {
                            if (result_len == 0 && !params.no_timestamps) {
-                                if (seek + seek_delta + 100 >= seek_end) {
+                                if (seek + seek_delta + delta_min >= seek_end) {
                                    result_len = i + 1;
                                } else {
                                    WHISPER_LOG_DEBUG("%s: decoder %d failed (result_len = 0)\n", __func__, j);
@ -6375,7 +6377,7 @@ int whisper_full_with_state(
                }
            }
-            // ref: https://github.com/ggerganov/whisper.cpp/pull/2629
+            // ref: https://github.com/ggml-org/whisper.cpp/pull/2629
            const bool single_timestamp_ending = tokens_cur.size() > 1 &&
                tokens_cur[tokens_cur.size() - 2].id < whisper_token_beg(ctx) &&
                tokens_cur[tokens_cur.size() - 1].id > whisper_token_beg(ctx);
--- a/tests/librispeech/.gitignore
+++ b/tests/librispeech/.gitignore
@ -0,0 +1,6 @@
 __pycache__
 *.tar.gz
 *.txt
 eval.conf
 venv
 LibriSpeech
--- a/tests/librispeech/Makefile
+++ b/tests/librispeech/Makefile
@ -0,0 +1,15 @@
 TAR_URL = https://www.openslr.org/resources/12/test-clean.tar.gz
 all: eval
 eval:
 	$(MAKE) -f eval.mk
 clean:
 	$(MAKE) -f eval.mk clean
 get-audio:
 	wget -c $(TAR_URL)
 	tar -xf test-clean.tar.gz
 .PHONY: all eval clean setup-venv clean-venv get-audio
--- a/tests/librispeech/README.md
+++ b/tests/librispeech/README.md
@ -0,0 +1,60 @@
 # whisper.cpp/tests/librispeech
 [LibriSpeech](https://www.openslr.org/12) is a standard dataset for
 training and evaluating automatic speech recognition systems.
 This directory contains a set of tools to evaluate the recognition
 performance of whisper.cpp on LibriSpeech corpus.
 ## Quick Start
 1. (Pre-requirement) Compile `whisper-cli` and prepare the Whisper
   model in `ggml` format.
   ```
   $ # Execute the commands below in the project root dir.
   $ cmake -B build
   $ cmake --build build --config Release
   $ ./models/download-ggml-model.sh tiny
   ```
   Consult [whisper.cpp/README.md](../../README.md) for more details.
 2. Download the audio files from LibriSpeech project.
   ```
   $ make get-audio
   ```
 3. Set up the environment to compute WER score.
   ```
   $ pip install -r requirements.txt
   ```
   For example, if you use `virtualenv`, you can set up it as follows:
   ```
   $ python3 -m venv venv
   $ . venv/bin/activate
   $ pip install -r requirements.txt
   ```
 4. Run the benchmark test.
   ```
   $ make
   ```
 ## How-to guides
 ### How to change the inferece parameters
 Create `eval.conf` and override variables.
 ```
 WHISPER_MODEL = large-v3-turbo
 WHISPER_FLAGS = --no-prints --threads 8 --language en --output-txt
 ```
 Check out `eval.mk` for more details.
--- a/tests/librispeech/eval.mk
+++ b/tests/librispeech/eval.mk
@ -0,0 +1,39 @@
 PYTHON = python
 WHISPER_PREFIX = ../../
 WHISPER_MODEL = tiny
 WHISPER_CLI = $(WHISPER_PREFIX)build/bin/whisper-cli
 WHISPER_FLAGS = --no-prints --language en --output-txt
 # You can create eval.conf to override the WHISPER_* variables
 # defined above.
 -include eval.conf
 # This follows the file structure of the LibriSpeech project.
 AUDIO_SRCS = $(sort $(wildcard LibriSpeech/*/*/*/*.flac))
 TRANS_TXTS = $(addsuffix .txt, $(AUDIO_SRCS))
 # We output the evaluation result to this file.
 DONE = $(WHISPER_MODEL).txt
 all: $(DONE)
 $(DONE): $(TRANS_TXTS)
 	$(PYTHON) eval.py > $@.tmp
 	mv $@.tmp $@
 # Note: This task writes to a temporary file first to
 # create the target file atomically.
 %.flac.txt: %.flac
 	$(WHISPER_CLI) $(WHISPER_FLAGS) --model $(WHISPER_PREFIX)models/ggml-$(WHISPER_MODEL).bin --file $^ --output-file $^.tmp
 	mv $^.tmp.txt $^.txt
 archive:
 	tar -czf $(WHISPER_MODEL).tar.gz --exclude="*.flac" LibriSpeech $(DONE)
 clean:
 	@rm -f $(TRANS_TXTS)
 	@rm -f $(DONE)
 .PHONY: all clean
--- a/tests/librispeech/eval.py
+++ b/tests/librispeech/eval.py
@ -0,0 +1,47 @@
 import os
 import glob
 import jiwer
 from normalizers import EnglishTextNormalizer
 def get_reference():
    ref = {}
    for path in glob.glob('LibriSpeech/*/*/*/*.trans.txt'):
        with open(path) as fp:
            for line in fp:
                code, text = line.strip().split(" ", maxsplit=1)
                ref [code] = text
    return ref
 def get_hypothesis():
    hyp = {}
    for path in glob.glob('LibriSpeech/*/*/*/*.flac.txt'):
        with open(path) as fp:
            text = fp.read().strip()
        code = os.path.basename(path).replace('.flac.txt', '')
        hyp[code] = text
    return hyp
 def get_codes():
    codes = []
    for path in glob.glob('LibriSpeech/*/*/*/*.flac'):
        codes.append(os.path.basename(path).replace('.flac', ''))
    return sorted(codes)
 def main():
    normalizer = EnglishTextNormalizer()
    ref_orig = get_reference()
    hyp_orig = get_hypothesis()
    ref_clean = []
    hyp_clean = []
    for code in get_codes():
        ref_clean.append(normalizer(ref_orig[code]))
        hyp_clean.append(normalizer(hyp_orig[code]))
    wer = jiwer.wer(ref_clean, hyp_clean)
    print(f"WER: {wer * 100:.2f}%")
 if __name__ == '__main__':
    main()
--- a/tests/librispeech/normalizers/LICENSE
+++ b/tests/librispeech/normalizers/LICENSE
@ -0,0 +1,25 @@
 Code in this directory is adapted from OpenAI Whisper project
 (https://github.com/openai/whisper) and carries the following
 copyright and license.
    MIT License
    Copyright (c) 2022 OpenAI
    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:
    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.
    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE.
--- a/tests/librispeech/normalizers/init.py
+++ b/tests/librispeech/normalizers/init.py
@ -0,0 +1,2 @@
 from .basic import BasicTextNormalizer as BasicTextNormalizer
 from .english import EnglishTextNormalizer as EnglishTextNormalizer
--- a/tests/librispeech/normalizers/basic.py
+++ b/tests/librispeech/normalizers/basic.py
@ -0,0 +1,80 @@
 import re
 import unicodedata
 import regex
 # non-ASCII letters that are not separated by "NFKD" normalization
 ADDITIONAL_DIACRITICS = {
    "œ": "oe",
    "Œ": "OE",
    "ø": "o",
    "Ø": "O",
    "æ": "ae",
    "Æ": "AE",
    "ß": "ss",
    "ẞ": "SS",
    "đ": "d",
    "Đ": "D",
    "ð": "d",
    "Ð": "D",
    "þ": "th",
    "Þ": "th",
    "ł": "l",
    "Ł": "L",
 }
 def remove_symbols_and_diacritics(s: str, keep=""):
    """
    Replace any other markers, symbols, and punctuations with a space,
    and drop any diacritics (category 'Mn' and some manual mappings)
    """
    return "".join(
        (
            c
            if c in keep
            else (
                ADDITIONAL_DIACRITICS[c]
                if c in ADDITIONAL_DIACRITICS
                else (
                    ""
                    if unicodedata.category(c) == "Mn"
                    else " " if unicodedata.category(c)[0] in "MSP" else c
                )
            )
        )
        for c in unicodedata.normalize("NFKD", s)
    )
 def remove_symbols(s: str):
    """
    Replace any other markers, symbols, punctuations with a space, keeping diacritics
    """
    return "".join(
        " " if unicodedata.category(c)[0] in "MSP" else c
        for c in unicodedata.normalize("NFKC", s)
    )
 class BasicTextNormalizer:
    def __init__(self, remove_diacritics: bool = False, split_letters: bool = False):
        self.clean = (
            remove_symbols_and_diacritics if remove_diacritics else remove_symbols
        )
        self.split_letters = split_letters
    def __call__(self, s: str):
        s = s.lower()
        s = re.sub(r"[<\[][^>\]]*[>\]]", "", s)  # remove words between brackets
        s = re.sub(r"\(([^)]+?)\)", "", s)  # remove words between parenthesis
        s = self.clean(s).lower()
        if self.split_letters:
            s = " ".join(regex.findall(r"\X", s, regex.U))
        s = re.sub(
            r"\s+", " ", s
        )  # replace any successive whitespace characters with a space
        return s
--- a/tests/librispeech/normalizers/english.json
+++ b/tests/librispeech/normalizers/english.json
--- a/tests/librispeech/normalizers/english.py
+++ b/tests/librispeech/normalizers/english.py
@ -0,0 +1,550 @@
 import json
 import os
 import re
 from fractions import Fraction
 from typing import Iterator, List, Match, Optional, Union
 from more_itertools import windowed
 from .basic import remove_symbols_and_diacritics
 class EnglishNumberNormalizer:
    """
    Convert any spelled-out numbers into arabic numbers, while handling:
    - remove any commas
    - keep the suffixes such as: `1960s`, `274th`, `32nd`, etc.
    - spell out currency symbols after the number. e.g. `$20 million` -> `20000000 dollars`
    - spell out `one` and `ones`
    - interpret successive single-digit numbers as nominal: `one oh one` -> `101`
    """
    def __init__(self):
        super().__init__()
        self.zeros = {"o", "oh", "zero"}
        self.ones = {
            name: i
            for i, name in enumerate(
                [
                    "one",
                    "two",
                    "three",
                    "four",
                    "five",
                    "six",
                    "seven",
                    "eight",
                    "nine",
                    "ten",
                    "eleven",
                    "twelve",
                    "thirteen",
                    "fourteen",
                    "fifteen",
                    "sixteen",
                    "seventeen",
                    "eighteen",
                    "nineteen",
                ],
                start=1,
            )
        }
        self.ones_plural = {
            "sixes" if name == "six" else name + "s": (value, "s")
            for name, value in self.ones.items()
        }
        self.ones_ordinal = {
            "zeroth": (0, "th"),
            "first": (1, "st"),
            "second": (2, "nd"),
            "third": (3, "rd"),
            "fifth": (5, "th"),
            "twelfth": (12, "th"),
            **{
                name + ("h" if name.endswith("t") else "th"): (value, "th")
                for name, value in self.ones.items()
                if value > 3 and value != 5 and value != 12
            },
        }
        self.ones_suffixed = {**self.ones_plural, **self.ones_ordinal}
        self.tens = {
            "twenty": 20,
            "thirty": 30,
            "forty": 40,
            "fifty": 50,
            "sixty": 60,
            "seventy": 70,
            "eighty": 80,
            "ninety": 90,
        }
        self.tens_plural = {
            name.replace("y", "ies"): (value, "s") for name, value in self.tens.items()
        }
        self.tens_ordinal = {
            name.replace("y", "ieth"): (value, "th")
            for name, value in self.tens.items()
        }
        self.tens_suffixed = {**self.tens_plural, **self.tens_ordinal}
        self.multipliers = {
            "hundred": 100,
            "thousand": 1_000,
            "million": 1_000_000,
            "billion": 1_000_000_000,
            "trillion": 1_000_000_000_000,
            "quadrillion": 1_000_000_000_000_000,
            "quintillion": 1_000_000_000_000_000_000,
            "sextillion": 1_000_000_000_000_000_000_000,
            "septillion": 1_000_000_000_000_000_000_000_000,
            "octillion": 1_000_000_000_000_000_000_000_000_000,
            "nonillion": 1_000_000_000_000_000_000_000_000_000_000,
            "decillion": 1_000_000_000_000_000_000_000_000_000_000_000,
        }
        self.multipliers_plural = {
            name + "s": (value, "s") for name, value in self.multipliers.items()
        }
        self.multipliers_ordinal = {
            name + "th": (value, "th") for name, value in self.multipliers.items()
        }
        self.multipliers_suffixed = {
            **self.multipliers_plural,
            **self.multipliers_ordinal,
        }
        self.decimals = {*self.ones, *self.tens, *self.zeros}
        self.preceding_prefixers = {
            "minus": "-",
            "negative": "-",
            "plus": "+",
            "positive": "+",
        }
        self.following_prefixers = {
            "pound": "£",
            "pounds": "£",
            "euro": "€",
            "euros": "€",
            "dollar": "$",
            "dollars": "$",
            "cent": "¢",
            "cents": "¢",
        }
        self.prefixes = set(
            list(self.preceding_prefixers.values())
            + list(self.following_prefixers.values())
        )
        self.suffixers = {
            "per": {"cent": "%"},
            "percent": "%",
        }
        self.specials = {"and", "double", "triple", "point"}
        self.words = set(
            [
                key
                for mapping in [
                    self.zeros,
                    self.ones,
                    self.ones_suffixed,
                    self.tens,
                    self.tens_suffixed,
                    self.multipliers,
                    self.multipliers_suffixed,
                    self.preceding_prefixers,
                    self.following_prefixers,
                    self.suffixers,
                    self.specials,
                ]
                for key in mapping
            ]
        )
        self.literal_words = {"one", "ones"}
    def process_words(self, words: List[str]) -> Iterator[str]:
        prefix: Optional[str] = None
        value: Optional[Union[str, int]] = None
        skip = False
        def to_fraction(s: str):
            try:
                return Fraction(s)
            except ValueError:
                return None
        def output(result: Union[str, int]):
            nonlocal prefix, value
            result = str(result)
            if prefix is not None:
                result = prefix + result
            value = None
            prefix = None
            return result
        if len(words) == 0:
            return
        for prev, current, next in windowed([None] + words + [None], 3):
            if skip:
                skip = False
                continue
            next_is_numeric = next is not None and re.match(r"^\d+(\.\d+)?$", next)
            has_prefix = current[0] in self.prefixes
            current_without_prefix = current[1:] if has_prefix else current
            if re.match(r"^\d+(\.\d+)?$", current_without_prefix):
                # arabic numbers (potentially with signs and fractions)
                f = to_fraction(current_without_prefix)
                assert f is not None
                if value is not None:
                    if isinstance(value, str) and value.endswith("."):
                        # concatenate decimals / ip address components
                        value = str(value) + str(current)
                        continue
                    else:
                        yield output(value)
                prefix = current[0] if has_prefix else prefix
                if f.denominator == 1:
                    value = f.numerator  # store integers as int
                else:
                    value = current_without_prefix
            elif current not in self.words:
                # non-numeric words
                if value is not None:
                    yield output(value)
                yield output(current)
            elif current in self.zeros:
                value = str(value or "") + "0"
            elif current in self.ones:
                ones = self.ones[current]
                if value is None:
                    value = ones
                elif isinstance(value, str) or prev in self.ones:
                    if (
                        prev in self.tens and ones < 10
                    ):  # replace the last zero with the digit
                        assert value[-1] == "0"
                        value = value[:-1] + str(ones)
                    else:
                        value = str(value) + str(ones)
                elif ones < 10:
                    if value % 10 == 0:
                        value += ones
                    else:
                        value = str(value) + str(ones)
                else:  # eleven to nineteen
                    if value % 100 == 0:
                        value += ones
                    else:
                        value = str(value) + str(ones)
            elif current in self.ones_suffixed:
                # ordinal or cardinal; yield the number right away
                ones, suffix = self.ones_suffixed[current]
                if value is None:
                    yield output(str(ones) + suffix)
                elif isinstance(value, str) or prev in self.ones:
                    if prev in self.tens and ones < 10:
                        assert value[-1] == "0"
                        yield output(value[:-1] + str(ones) + suffix)
                    else:
                        yield output(str(value) + str(ones) + suffix)
                elif ones < 10:
                    if value % 10 == 0:
                        yield output(str(value + ones) + suffix)
                    else:
                        yield output(str(value) + str(ones) + suffix)
                else:  # eleven to nineteen
                    if value % 100 == 0:
                        yield output(str(value + ones) + suffix)
                    else:
                        yield output(str(value) + str(ones) + suffix)
                value = None
            elif current in self.tens:
                tens = self.tens[current]
                if value is None:
                    value = tens
                elif isinstance(value, str):
                    value = str(value) + str(tens)
                else:
                    if value % 100 == 0:
                        value += tens
                    else:
                        value = str(value) + str(tens)
            elif current in self.tens_suffixed:
                # ordinal or cardinal; yield the number right away
                tens, suffix = self.tens_suffixed[current]
                if value is None:
                    yield output(str(tens) + suffix)
                elif isinstance(value, str):
                    yield output(str(value) + str(tens) + suffix)
                else:
                    if value % 100 == 0:
                        yield output(str(value + tens) + suffix)
                    else:
                        yield output(str(value) + str(tens) + suffix)
            elif current in self.multipliers:
                multiplier = self.multipliers[current]
                if value is None:
                    value = multiplier
                elif isinstance(value, str) or value == 0:
                    f = to_fraction(value)
                    p = f * multiplier if f is not None else None
                    if f is not None and p.denominator == 1:
                        value = p.numerator
                    else:
                        yield output(value)
                        value = multiplier
                else:
                    before = value // 1000 * 1000
                    residual = value % 1000
                    value = before + residual * multiplier
            elif current in self.multipliers_suffixed:
                multiplier, suffix = self.multipliers_suffixed[current]
                if value is None:
                    yield output(str(multiplier) + suffix)
                elif isinstance(value, str):
                    f = to_fraction(value)
                    p = f * multiplier if f is not None else None
                    if f is not None and p.denominator == 1:
                        yield output(str(p.numerator) + suffix)
                    else:
                        yield output(value)
                        yield output(str(multiplier) + suffix)
                else:  # int
                    before = value // 1000 * 1000
                    residual = value % 1000
                    value = before + residual * multiplier
                    yield output(str(value) + suffix)
                value = None
            elif current in self.preceding_prefixers:
                # apply prefix (positive, minus, etc.) if it precedes a number
                if value is not None:
                    yield output(value)
                if next in self.words or next_is_numeric:
                    prefix = self.preceding_prefixers[current]
                else:
                    yield output(current)
            elif current in self.following_prefixers:
                # apply prefix (dollars, cents, etc.) only after a number
                if value is not None:
                    prefix = self.following_prefixers[current]
                    yield output(value)
                else:
                    yield output(current)
            elif current in self.suffixers:
                # apply suffix symbols (percent -> '%')
                if value is not None:
                    suffix = self.suffixers[current]
                    if isinstance(suffix, dict):
                        if next in suffix:
                            yield output(str(value) + suffix[next])
                            skip = True
                        else:
                            yield output(value)
                            yield output(current)
                    else:
                        yield output(str(value) + suffix)
                else:
                    yield output(current)
            elif current in self.specials:
                if next not in self.words and not next_is_numeric:
                    # apply special handling only if the next word can be numeric
                    if value is not None:
                        yield output(value)
                    yield output(current)
                elif current == "and":
                    # ignore "and" after hundreds, thousands, etc.
                    if prev not in self.multipliers:
                        if value is not None:
                            yield output(value)
                        yield output(current)
                elif current == "double" or current == "triple":
                    if next in self.ones or next in self.zeros:
                        repeats = 2 if current == "double" else 3
                        ones = self.ones.get(next, 0)
                        value = str(value or "") + str(ones) * repeats
                        skip = True
                    else:
                        if value is not None:
                            yield output(value)
                        yield output(current)
                elif current == "point":
                    if next in self.decimals or next_is_numeric:
                        value = str(value or "") + "."
                else:
                    # should all have been covered at this point
                    raise ValueError(f"Unexpected token: {current}")
            else:
                # all should have been covered at this point
                raise ValueError(f"Unexpected token: {current}")
        if value is not None:
            yield output(value)
    def preprocess(self, s: str):
        # replace "<number> and a half" with "<number> point five"
        results = []
        segments = re.split(r"\band\s+a\s+half\b", s)
        for i, segment in enumerate(segments):
            if len(segment.strip()) == 0:
                continue
            if i == len(segments) - 1:
                results.append(segment)
            else:
                results.append(segment)
                last_word = segment.rsplit(maxsplit=2)[-1]
                if last_word in self.decimals or last_word in self.multipliers:
                    results.append("point five")
                else:
                    results.append("and a half")
        s = " ".join(results)
        # put a space at number/letter boundary
        s = re.sub(r"([a-z])([0-9])", r"\1 \2", s)
        s = re.sub(r"([0-9])([a-z])", r"\1 \2", s)
        # but remove spaces which could be a suffix
        s = re.sub(r"([0-9])\s+(st|nd|rd|th|s)\b", r"\1\2", s)
        return s
    def postprocess(self, s: str):
        def combine_cents(m: Match):
            try:
                currency = m.group(1)
                integer = m.group(2)
                cents = int(m.group(3))
                return f"{currency}{integer}.{cents:02d}"
            except ValueError:
                return m.string
        def extract_cents(m: Match):
            try:
                return f"¢{int(m.group(1))}"
            except ValueError:
                return m.string
        # apply currency postprocessing; "$2 and ¢7" -> "$2.07"
        s = re.sub(r"([€£$])([0-9]+) (?:and )?¢([0-9]{1,2})\b", combine_cents, s)
        s = re.sub(r"[€£$]0.([0-9]{1,2})\b", extract_cents, s)
        # write "one(s)" instead of "1(s)", just for the readability
        s = re.sub(r"\b1(s?)\b", r"one\1", s)
        return s
    def __call__(self, s: str):
        s = self.preprocess(s)
        s = " ".join(word for word in self.process_words(s.split()) if word is not None)
        s = self.postprocess(s)
        return s
 class EnglishSpellingNormalizer:
    """
    Applies British-American spelling mappings as listed in [1].
    [1] https://www.tysto.com/uk-us-spelling-list.html
    """
    def __init__(self):
        mapping_path = os.path.join(os.path.dirname(__file__), "english.json")
        self.mapping = json.load(open(mapping_path))
    def __call__(self, s: str):
        return " ".join(self.mapping.get(word, word) for word in s.split())
 class EnglishTextNormalizer:
    def __init__(self):
        self.ignore_patterns = r"\b(hmm|mm|mhm|mmm|uh|um)\b"
        self.replacers = {
            # common contractions
            r"\bwon't\b": "will not",
            r"\bcan't\b": "can not",
            r"\blet's\b": "let us",
            r"\bain't\b": "aint",
            r"\by'all\b": "you all",
            r"\bwanna\b": "want to",
            r"\bgotta\b": "got to",
            r"\bgonna\b": "going to",
            r"\bi'ma\b": "i am going to",
            r"\bimma\b": "i am going to",
            r"\bwoulda\b": "would have",
            r"\bcoulda\b": "could have",
            r"\bshoulda\b": "should have",
            r"\bma'am\b": "madam",
            # contractions in titles/prefixes
            r"\bmr\b": "mister ",
            r"\bmrs\b": "missus ",
            r"\bst\b": "saint ",
            r"\bdr\b": "doctor ",
            r"\bprof\b": "professor ",
            r"\bcapt\b": "captain ",
            r"\bgov\b": "governor ",
            r"\bald\b": "alderman ",
            r"\bgen\b": "general ",
            r"\bsen\b": "senator ",
            r"\brep\b": "representative ",
            r"\bpres\b": "president ",
            r"\brev\b": "reverend ",
            r"\bhon\b": "honorable ",
            r"\basst\b": "assistant ",
            r"\bassoc\b": "associate ",
            r"\blt\b": "lieutenant ",
            r"\bcol\b": "colonel ",
            r"\bjr\b": "junior ",
            r"\bsr\b": "senior ",
            r"\besq\b": "esquire ",
            # prefect tenses, ideally it should be any past participles, but it's harder..
            r"'d been\b": " had been",
            r"'s been\b": " has been",
            r"'d gone\b": " had gone",
            r"'s gone\b": " has gone",
            r"'d done\b": " had done",  # "'s done" is ambiguous
            r"'s got\b": " has got",
            # general contractions
            r"n't\b": " not",
            r"'re\b": " are",
            r"'s\b": " is",
            r"'d\b": " would",
            r"'ll\b": " will",
            r"'t\b": " not",
            r"'ve\b": " have",
            r"'m\b": " am",
        }
        self.standardize_numbers = EnglishNumberNormalizer()
        self.standardize_spellings = EnglishSpellingNormalizer()
    def __call__(self, s: str):
        s = s.lower()
        s = re.sub(r"[<\[][^>\]]*[>\]]", "", s)  # remove words between brackets
        s = re.sub(r"\(([^)]+?)\)", "", s)  # remove words between parenthesis
        s = re.sub(self.ignore_patterns, "", s)
        s = re.sub(r"\s+'", "'", s)  # when there's a space before an apostrophe
        for pattern, replacement in self.replacers.items():
            s = re.sub(pattern, replacement, s)
        s = re.sub(r"(\d),(\d)", r"\1\2", s)  # remove commas between digits
        s = re.sub(r"\.([^0-9]|$)", r" \1", s)  # remove periods not followed by numbers
        s = remove_symbols_and_diacritics(s, keep=".%$¢€£")  # keep numeric symbols
        s = self.standardize_numbers(s)
        s = self.standardize_spellings(s)
        # now remove prefix/suffix symbols that are not preceded/followed by numbers
        s = re.sub(r"[.$¢€£]([^0-9])", r" \1", s)
        s = re.sub(r"([^0-9])%", r"\1 ", s)
        s = re.sub(r"\s+", " ", s)  # replace any successive whitespaces with a space
        return s
--- a/tests/librispeech/requirements.txt
+++ b/tests/librispeech/requirements.txt
@ -0,0 +1,6 @@
 # This is the minimal set of dependencies we need to compute
 # WER score. Read Section 3.2. of the original paper
 # (https://arxiv.org/abs/2212.04356) for more contexts.
 jiwer
 regex
 more-itertools
`@ -1 +1 @@`
	`7d7aa2dee2eb55dc683af80b769b81a0642226a1`	`d920dfd7da37b22d1eb0813cdaf340c1870d76c3`
		`@ -0,0 +1,2 @@`
							`from .basic import BasicTextNormalizer as BasicTextNormalizer`
							`from .english import EnglishTextNormalizer as EnglishTextNormalizer`