cmake : bump version to 1.1.0

Update README.md
whisper : account speed_up flag for short audio (close #405 )
2025-06-24 17:15:19 +00:00 · 2023-01-15 14:33:13 +02:00 · 2023-01-15 14:08:12 +02:00 · 2023-01-15 12:42:15 +02:00 · 2023-01-15 11:36:20 +02:00 · 2023-01-15 11:34:03 +02:00
41 changed files with 2570 additions and 1272 deletions
--- a/.github/workflows/bindings.yml
+++ b/.github/workflows/bindings.yml
@ -0,0 +1,17 @@
+name: Bindings Tests
+on:
+  push:
+    paths:
+      - bindings/go/**
+
+jobs:
+    ubuntu-latest:
+      runs-on: ubuntu-latest
+      steps:
+      - uses: actions/setup-go@v3
+        with:
+          go-version: '^1.19'
+      - uses: actions/checkout@v1
+      - run: |
+          cd bindings/go
+          make test
--- a/.gitignore
+++ b/.gitignore
@ -8,6 +8,7 @@ build/
 build-em/
 build-debug/
 build-release/
+build-static/
 build-sanitize-addr/
 build-sanitize-thread/

@ -18,6 +19,7 @@ build-sanitize-thread/
 /bench

 sync.sh
+libwhisper.a
 libwhisper.so
 compile_commands.json

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -1,6 +1,6 @@
 cmake_minimum_required (VERSION 3.0)

-project(whisper.cpp VERSION 1.0.4)
+project(whisper.cpp VERSION 1.1.0)

 # Add path to modules
 list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake/")
--- a/README.md
+++ b/README.md
@ -4,7 +4,7 @@
 [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
 [![npm](https://img.shields.io/npm/v/whisper.cpp.svg)](https://www.npmjs.com/package/whisper.cpp/)

-[Roadmap | F.A.Q.](https://github.com/ggerganov/whisper.cpp/discussions/126)
+Stable: [1.0.4](https://github.com/ggerganov/whisper.cpp/releases/tag/1.0.4) / Beta: [1.1.0](https://github.com/ggerganov/whisper.cpp/releases/tag/1.1.0) / [Roadmap | F.A.Q.](https://github.com/ggerganov/whisper.cpp/discussions/126)

 High-performance inference of [OpenAI's Whisper](https://github.com/openai/whisper) automatic speech recognition (ASR) model:

@ -71,7 +71,7 @@ Now build the [main](examples/main) example and transcribe an audio file like th
 make

 # transcribe an audio file
-./main -f input.wav
+./main -f samples/jfk.wav
 ```

 ---
@ -89,27 +89,36 @@ c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o gg
 usage: ./main [options] file0.wav file1.wav ...

 options:
-  -h,       --help          [default] show this help message and exit
-  -t N,     --threads N     [4      ] number of threads to use during computation
-  -p N,     --processors N  [1      ] number of processors to use during computation
-  -ot N,    --offset-t N    [0      ] time offset in milliseconds
-  -on N,    --offset-n N    [0      ] segment index offset
-  -d  N,    --duration N    [0      ] duration of audio to process in milliseconds
-  -mc N,    --max-context N [-1     ] maximum number of text context tokens to store
-  -ml N,    --max-len N     [0      ] maximum segment length in characters
-  -wt N,    --word-thold N  [0.01   ] word timestamp probability threshold
-  -su,      --speed-up      [false  ] speed up audio by x2 (reduced accuracy)
-  -tr,      --translate     [false  ] translate from source language to english
-  -otxt,    --output-txt    [false  ] output result in a text file
-  -ovtt,    --output-vtt    [false  ] output result in a vtt file
-  -osrt,    --output-srt    [false  ] output result in a srt file
-  -owts,    --output-words  [false  ] output script for generating karaoke video
-  -ps,      --print-special [false  ] print special tokens
-  -pc,      --print-colors  [false  ] print colors
-  -nt,      --no-timestamps [true   ] do not print timestamps
-  -l LANG,  --language LANG [en     ] spoken language
-  -m FNAME, --model FNAME   [models/ggml-base.en.bin] model path
-  -f FNAME, --file FNAME    [       ] input WAV file path
+  -h,       --help            [default] show this help message and exit
+  -t N,     --threads N       [4      ] number of threads to use during computation
+  -p N,     --processors N    [1      ] number of processors to use during computation
+  -ot N,    --offset-t N      [0      ] time offset in milliseconds
+  -on N,    --offset-n N      [0      ] segment index offset
+  -d  N,    --duration N      [0      ] duration of audio to process in milliseconds
+  -mc N,    --max-context N   [-1     ] maximum number of text context tokens to store
+  -ml N,    --max-len N       [0      ] maximum segment length in characters
+  -bo N,    --best-of N       [5      ] number of best candidates to keep
+  -bs N,    --beam-size N     [-1     ] beam size for beam search
+  -wt N,    --word-thold N    [0.01   ] word timestamp probability threshold
+  -et N,    --entropy-thold N [2.40   ] entropy threshold for decoder fail
+  -lpt N,   --logprob-thold N [-1.00  ] log probability threshold for decoder fail
+  -su,      --speed-up        [false  ] speed up audio by x2 (reduced accuracy)
+  -tr,      --translate       [false  ] translate from source language to english
+  -di,      --diarize         [false  ] stereo audio diarization
+  -otxt,    --output-txt      [false  ] output result in a text file
+  -ovtt,    --output-vtt      [false  ] output result in a vtt file
+  -osrt,    --output-srt      [false  ] output result in a srt file
+  -owts,    --output-words    [false  ] output script for generating karaoke video
+  -ocsv,    --output-csv      [false  ] output result in a CSV file
+  -ps,      --print-special   [false  ] print special tokens
+  -pc,      --print-colors    [false  ] print colors
+  -pp,      --print-progress  [false  ] print progress
+  -nt,      --no-timestamps   [true   ] do not print timestamps
+  -l LANG,  --language LANG   [en     ] spoken language ('auto' for auto-detect)
+            --prompt PROMPT   [       ] initial prompt
+  -m FNAME, --model FNAME     [models/ggml-base.en.bin] model path
+  -f FNAME, --file FNAME      [       ] input WAV file path
+

 bash ./models/download-ggml-model.sh base.en
 Downloading ggml model base.en ...
@ -212,17 +221,7 @@ make large
 ## Limitations

 - Inference only
- No GPU support
- Very basic greedy sampling scheme - always pick up the token with highest probability.
-  This should be similar to the [GreedyDecoder](https://github.com/openai/whisper/blob/main/whisper/decoding.py#L249-L274)
-  from the original python implementation, so in order to make a fair comparison between the 2 implementations, make sure
-  to run the python code with the following parameters:
-
-  ```
-  whisper --best_of None --beam_size None ...
-  ```
-
-  In the future, `whisper.cpp` will support more sampling strategies.
+- No GPU support (yet)

 ## Another example

@ -307,6 +306,7 @@ The [stream](examples/stream) tool samples the audio every half a second and run
 More info is available in [issue #10](https://github.com/ggerganov/whisper.cpp/issues/10).

 ```java
+make stream
 ./stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000
 ```

--- a/bindings/go/examples/go-whisper/color.go
+++ b/bindings/go/examples/go-whisper/color.go
@ -0,0 +1,22 @@
+package main
+
+import "fmt"
+
+///////////////////////////////////////////////////////////////////////////////
+// CONSTANTS
+
+const (
+	Reset     = "\033[0m"
+	RGBPrefix = "\033[38;5;" // followed by RGB values in decimal format separated by colons
+	RGBSuffix = "m"
+)
+
+///////////////////////////////////////////////////////////////////////////////
+// PUBLIC METHODS
+
+// Colorize text with RGB values, from 0 to 23
+func Colorize(text string, v int) string {
+	// https://en.wikipedia.org/wiki/ANSI_escape_code#8-bit
+	// Grayscale colors are in the range 232-255
+	return RGBPrefix + fmt.Sprint(v%24+232) + RGBSuffix + text + Reset
+}
--- a/bindings/go/examples/go-whisper/flags.go
+++ b/bindings/go/examples/go-whisper/flags.go
@ -2,6 +2,12 @@ package main

 import (
 	"flag"
+	"fmt"
+	"strings"
+	"time"
+
+	// Packages
+	whisper "github.com/ggerganov/whisper.cpp/bindings/go/pkg/whisper"
 )

 ///////////////////////////////////////////////////////////////////////////////
@ -42,6 +48,26 @@ func (flags *Flags) GetLanguage() string {
 	return flags.Lookup("language").Value.String()
 }

+func (flags *Flags) IsTranslate() bool {
+	return flags.Lookup("translate").Value.(flag.Getter).Get().(bool)
+}
+
+func (flags *Flags) GetOffset() time.Duration {
+	return flags.Lookup("offset").Value.(flag.Getter).Get().(time.Duration)
+}
+
+func (flags *Flags) GetDuration() time.Duration {
+	return flags.Lookup("duration").Value.(flag.Getter).Get().(time.Duration)
+}
+
+func (flags *Flags) GetThreads() uint {
+	return flags.Lookup("threads").Value.(flag.Getter).Get().(uint)
+}
+
+func (flags *Flags) GetOut() string {
+	return strings.ToLower(flags.Lookup("out").Value.String())
+}
+
 func (flags *Flags) IsSpeedup() bool {
 	return flags.Lookup("speedup").Value.String() == "true"
 }
@ -50,12 +76,81 @@ func (flags *Flags) IsTokens() bool {
 	return flags.Lookup("tokens").Value.String() == "true"
 }

+func (flags *Flags) IsColorize() bool {
+	return flags.Lookup("colorize").Value.String() == "true"
+}
+
+func (flags *Flags) GetMaxLen() uint {
+	return flags.Lookup("max-len").Value.(flag.Getter).Get().(uint)
+}
+
+func (flags *Flags) GetMaxTokens() uint {
+	return flags.Lookup("max-tokens").Value.(flag.Getter).Get().(uint)
+}
+
+func (flags *Flags) GetWordThreshold() float32 {
+	return float32(flags.Lookup("word-thold").Value.(flag.Getter).Get().(float64))
+}
+
+func (flags *Flags) SetParams(context whisper.Context) error {
+	if lang := flags.GetLanguage(); lang != "" && lang != "auto" {
+		fmt.Fprintf(flags.Output(), "Setting language to %q\n", lang)
+		if err := context.SetLanguage(lang); err != nil {
+			return err
+		}
+	}
+	if flags.IsTranslate() && context.IsMultilingual() {
+		fmt.Fprintf(flags.Output(), "Setting translate to true\n")
+		context.SetTranslate(true)
+	}
+	if offset := flags.GetOffset(); offset != 0 {
+		fmt.Fprintf(flags.Output(), "Setting offset to %v\n", offset)
+		context.SetOffset(offset)
+	}
+	if duration := flags.GetDuration(); duration != 0 {
+		fmt.Fprintf(flags.Output(), "Setting duration to %v\n", duration)
+		context.SetDuration(duration)
+	}
+	if flags.IsSpeedup() {
+		fmt.Fprintf(flags.Output(), "Setting speedup to true\n")
+		context.SetSpeedup(true)
+	}
+	if threads := flags.GetThreads(); threads != 0 {
+		fmt.Fprintf(flags.Output(), "Setting threads to %d\n", threads)
+		context.SetThreads(threads)
+	}
+	if max_len := flags.GetMaxLen(); max_len != 0 {
+		fmt.Fprintf(flags.Output(), "Setting max_segment_length to %d\n", max_len)
+		context.SetMaxSegmentLength(max_len)
+	}
+	if max_tokens := flags.GetMaxTokens(); max_tokens != 0 {
+		fmt.Fprintf(flags.Output(), "Setting max_tokens to %d\n", max_tokens)
+		context.SetMaxTokensPerSegment(max_tokens)
+	}
+	if word_threshold := flags.GetWordThreshold(); word_threshold != 0 {
+		fmt.Fprintf(flags.Output(), "Setting word_threshold to %f\n", word_threshold)
+		context.SetTokenThreshold(word_threshold)
+	}
+
+	// Return success
+	return nil
+}
+
 ///////////////////////////////////////////////////////////////////////////////
 // PRIVATE METHODS

 func registerFlags(flag *Flags) {
 	flag.String("model", "", "Path to the model file")
-	flag.String("language", "", "Language")
+	flag.String("language", "", "Spoken language")
+	flag.Bool("translate", false, "Translate from source language to english")
+	flag.Duration("offset", 0, "Time offset")
+	flag.Duration("duration", 0, "Duration of audio to process")
+	flag.Uint("threads", 0, "Number of threads to use")
 	flag.Bool("speedup", false, "Enable speedup")
+	flag.Uint("max-len", 0, "Maximum segment length in characters")
+	flag.Uint("max-tokens", 0, "Maximum tokens per segment")
+	flag.Float64("word-thold", 0, "Maximum segment score")
 	flag.Bool("tokens", false, "Display tokens")
+	flag.Bool("colorize", false, "Colorize tokens")
+	flag.String("out", "", "Output format (srt, none or leave as empty string)")
 }
--- a/bindings/go/examples/go-whisper/main.go
+++ b/bindings/go/examples/go-whisper/main.go
@ -35,8 +35,7 @@ func main() {

 	// Process files
 	for _, filename := range flags.Args() {
-		fmt.Println("Processing", filename)
-		if err := Process(model, filename, flags.GetLanguage(), flags.IsSpeedup(), flags.IsTokens()); err != nil {
+		if err := Process(model, filename, flags); err != nil {
 			fmt.Fprintln(os.Stderr, err)
 			continue
 		}
--- a/bindings/go/examples/go-whisper/process.go
+++ b/bindings/go/examples/go-whisper/process.go
@ -11,7 +11,7 @@ import (
 	wav "github.com/go-audio/wav"
 )

-func Process(model whisper.Model, path string, lang string, speedup, tokens bool) error {
+func Process(model whisper.Model, path string, flags *Flags) error {
 	var data []float32

 	// Create processing context
@ -20,14 +20,20 @@ func Process(model whisper.Model, path string, lang string, speedup, tokens bool
 		return err
 	}

+	// Set the parameters
+	if err := flags.SetParams(context); err != nil {
+		return err
+	}
+
 	// Open the file
+	fmt.Fprintf(flags.Output(), "Loading %q\n", path)
 	fh, err := os.Open(path)
 	if err != nil {
 		return err
 	}
 	defer fh.Close()

-	// Decode the WAV file
+	// Decode the WAV file - load the full buffer
 	dec := wav.NewDecoder(fh)
 	if buf, err := dec.FullPCMBuffer(); err != nil {
 		return err
@ -39,42 +45,83 @@ func Process(model whisper.Model, path string, lang string, speedup, tokens bool
 		data = buf.AsFloat32Buffer().Data
 	}

-	// Set the parameters
+	// Segment callback when -tokens is specified
 	var cb whisper.SegmentCallback
-	if lang != "" {
-		if err := context.SetLanguage(lang); err != nil {
-			return err
-		}
-	}
-	if speedup {
-		context.SetSpeedup(true)
-	}
-	if tokens {
+	if flags.IsTokens() {
 		cb = func(segment whisper.Segment) {
-			fmt.Printf("%02d [%6s->%6s] ", segment.Num, segment.Start.Truncate(time.Millisecond), segment.End.Truncate(time.Millisecond))
+			fmt.Fprintf(flags.Output(), "%02d [%6s->%6s] ", segment.Num, segment.Start.Truncate(time.Millisecond), segment.End.Truncate(time.Millisecond))
 			for _, token := range segment.Tokens {
-				fmt.Printf("%q ", token.Text)
+				if flags.IsColorize() && context.IsText(token) {
+					fmt.Fprint(flags.Output(), Colorize(token.Text, int(token.P*24.0)), " ")
+				} else {
+					fmt.Fprint(flags.Output(), token.Text, " ")
+				}
 			}
-			fmt.Println("")
+			fmt.Fprintln(flags.Output(), "")
+			fmt.Fprintln(flags.Output(), "")
 		}
 	}

 	// Process the data
+	fmt.Fprintf(flags.Output(), "  ...processing %q\n", path)
 	if err := context.Process(data, cb); err != nil {
 		return err
 	}

 	// Print out the results
+	switch {
+	case flags.GetOut() == "srt":
+		return OutputSRT(os.Stdout, context)
+	case flags.GetOut() == "none":
+		return nil
+	default:
+		return Output(os.Stdout, context, flags.IsColorize())
+	}
+}
+
+// Output text as SRT file
+func OutputSRT(w io.Writer, context whisper.Context) error {
+	n := 1
 	for {
 		segment, err := context.NextSegment()
 		if err == io.EOF {
-			break
+			return nil
 		} else if err != nil {
 			return err
 		}
-		fmt.Printf("[%6s->%6s] %s\n", segment.Start.Truncate(time.Millisecond), segment.End.Truncate(time.Millisecond), segment.Text)
+		fmt.Fprintln(w, n)
+		fmt.Fprintln(w, srtTimestamp(segment.Start), " --> ", srtTimestamp(segment.End))
+		fmt.Fprintln(w, segment.Text)
+		fmt.Fprintln(w, "")
+		n++
 	}
-
-	// Return success
-	return nil
+}
+
+// Output text to terminal
+func Output(w io.Writer, context whisper.Context, colorize bool) error {
+	for {
+		segment, err := context.NextSegment()
+		if err == io.EOF {
+			return nil
+		} else if err != nil {
+			return err
+		}
+		fmt.Fprintf(w, "[%6s->%6s]", segment.Start.Truncate(time.Millisecond), segment.End.Truncate(time.Millisecond))
+		if colorize {
+			for _, token := range segment.Tokens {
+				if !context.IsText(token) {
+					continue
+				}
+				fmt.Fprint(w, " ", Colorize(token.Text, int(token.P*24.0)))
+			}
+			fmt.Fprint(w, "\n")
+		} else {
+			fmt.Fprintln(w, " ", segment.Text)
+		}
+	}
+}
+
+// Return srtTimestamp
+func srtTimestamp(t time.Duration) string {
+	return fmt.Sprintf("%02d:%02d:%02d,%03d", t/time.Hour, (t%time.Hour)/time.Minute, (t%time.Minute)/time.Second, (t%time.Second)/time.Millisecond)
 }
--- a/bindings/go/params.go
+++ b/bindings/go/params.go
@ -47,6 +47,7 @@ func (p *Params) SetSpeedup(v bool) {
 	p.speed_up = toBool(v)
 }

+// Set language id
 func (p *Params) SetLanguage(lang int) error {
 	str := C.whisper_lang_str(C.int(lang))
 	if str == nil {
@ -57,6 +58,7 @@ func (p *Params) SetLanguage(lang int) error {
 	return nil
 }

+// Get language id
 func (p *Params) Language() int {
 	if p.language == nil {
 		return -1
@ -64,18 +66,41 @@ func (p *Params) Language() int {
 	return int(C.whisper_lang_id(p.language))
 }

+// Set number of threads to use
 func (p *Params) SetThreads(threads int) {
 	p.n_threads = C.int(threads)
 }

+// Set start offset in ms
 func (p *Params) SetOffset(offset_ms int) {
 	p.offset_ms = C.int(offset_ms)
 }

+// Set audio duration to process in ms
 func (p *Params) SetDuration(duration_ms int) {
 	p.duration_ms = C.int(duration_ms)
 }

+// Set timestamp token probability threshold (~0.01)
+func (p *Params) SetTokenThreshold(t float32) {
+	p.thold_pt = C.float(t)
+}
+
+// Set timestamp token sum probability threshold (~0.01)
+func (p *Params) SetTokenSumThreshold(t float32) {
+	p.thold_ptsum = C.float(t)
+}
+
+// Set max segment length in characters
+func (p *Params) SetMaxSegmentLength(n int) {
+	p.max_len = C.int(n)
+}
+
+// Set max tokens per segment (0 = no limit)
+func (p *Params) SetMaxTokensPerSegment(n int) {
+	p.max_tokens = C.int(n)
+}
+
 ///////////////////////////////////////////////////////////////////////////////
 // PRIVATE METHODS

--- a/bindings/go/pkg/whisper/consts.go
+++ b/bindings/go/pkg/whisper/consts.go
@ -11,10 +11,11 @@ import (
 // ERRORS

 var (
-	ErrUnableToLoadModel   = errors.New("unable to load model")
-	ErrInternalAppError    = errors.New("internal application error")
-	ErrProcessingFailed    = errors.New("processing failed")
-	ErrUnsupportedLanguage = errors.New("unsupported language")
+	ErrUnableToLoadModel    = errors.New("unable to load model")
+	ErrInternalAppError     = errors.New("internal application error")
+	ErrProcessingFailed     = errors.New("processing failed")
+	ErrUnsupportedLanguage  = errors.New("unsupported language")
+	ErrModelNotMultilingual = errors.New("model is not multilingual")
 )

 ///////////////////////////////////////////////////////////////////////////////
--- a/bindings/go/pkg/whisper/context.go
+++ b/bindings/go/pkg/whisper/context.go
@ -24,7 +24,7 @@ var _ Context = (*context)(nil)
 ///////////////////////////////////////////////////////////////////////////////
 // LIFECYCLE

-func NewContext(model *model, params whisper.Params) (Context, error) {
+func newContext(model *model, params whisper.Params) (Context, error) {
 	context := new(context)
 	context.model = model
 	context.params = params
@ -41,6 +41,9 @@ func (context *context) SetLanguage(lang string) error {
 	if context.model.ctx == nil {
 		return ErrInternalAppError
 	}
+	if !context.model.IsMultilingual() {
+		return ErrModelNotMultilingual
+	}
 	if id := context.model.ctx.Whisper_lang_id(lang); id < 0 {
 		return ErrUnsupportedLanguage
 	} else if err := context.params.SetLanguage(id); err != nil {
@ -50,16 +53,60 @@ func (context *context) SetLanguage(lang string) error {
 	return nil
 }

+func (context *context) IsMultilingual() bool {
+	return context.model.IsMultilingual()
+}
+
 // Get language
 func (context *context) Language() string {
 	return whisper.Whisper_lang_str(context.params.Language())
 }

+// Set translate flag
+func (context *context) SetTranslate(v bool) {
+	context.params.SetTranslate(v)
+}
+
 // Set speedup flag
 func (context *context) SetSpeedup(v bool) {
 	context.params.SetSpeedup(v)
 }

+// Set number of threads to use
+func (context *context) SetThreads(v uint) {
+	context.params.SetThreads(int(v))
+}
+
+// Set time offset
+func (context *context) SetOffset(v time.Duration) {
+	context.params.SetOffset(int(v.Milliseconds()))
+}
+
+// Set duration of audio to process
+func (context *context) SetDuration(v time.Duration) {
+	context.params.SetOffset(int(v.Milliseconds()))
+}
+
+// Set timestamp token probability threshold (~0.01)
+func (context *context) SetTokenThreshold(t float32) {
+	context.params.SetTokenThreshold(t)
+}
+
+// Set timestamp token sum probability threshold (~0.01)
+func (context *context) SetTokenSumThreshold(t float32) {
+	context.params.SetTokenSumThreshold(t)
+}
+
+// Set max segment length in characters
+func (context *context) SetMaxSegmentLength(n uint) {
+	context.params.SetMaxSegmentLength(int(n))
+}
+
+// Set max tokens per segment (0 = no limit)
+func (context *context) SetMaxTokensPerSegment(n uint) {
+	context.params.SetMaxTokensPerSegment(int(n))
+}
+
 // Process new sample data and return any errors
 func (context *context) Process(data []float32, cb SegmentCallback) error {
 	if context.model.ctx == nil {
@ -119,6 +166,65 @@ func (context *context) NextSegment() (Segment, error) {
 	return result, nil
 }

+// Test for text tokens
+func (context *context) IsText(t Token) bool {
+	switch {
+	case context.IsBEG(t):
+		return false
+	case context.IsSOT(t):
+		return false
+	case whisper.Token(t.Id) >= context.model.ctx.Whisper_token_eot():
+		return false
+	case context.IsPREV(t):
+		return false
+	case context.IsSOLM(t):
+		return false
+	case context.IsNOT(t):
+		return false
+	default:
+		return true
+	}
+}
+
+// Test for "begin" token
+func (context *context) IsBEG(t Token) bool {
+	return whisper.Token(t.Id) == context.model.ctx.Whisper_token_beg()
+}
+
+// Test for "start of transcription" token
+func (context *context) IsSOT(t Token) bool {
+	return whisper.Token(t.Id) == context.model.ctx.Whisper_token_sot()
+}
+
+// Test for "end of transcription" token
+func (context *context) IsEOT(t Token) bool {
+	return whisper.Token(t.Id) == context.model.ctx.Whisper_token_eot()
+}
+
+// Test for "start of prev" token
+func (context *context) IsPREV(t Token) bool {
+	return whisper.Token(t.Id) == context.model.ctx.Whisper_token_prev()
+}
+
+// Test for "start of lm" token
+func (context *context) IsSOLM(t Token) bool {
+	return whisper.Token(t.Id) == context.model.ctx.Whisper_token_solm()
+}
+
+// Test for "No timestamps" token
+func (context *context) IsNOT(t Token) bool {
+	return whisper.Token(t.Id) == context.model.ctx.Whisper_token_not()
+}
+
+// Test for token associated with a specific language
+func (context *context) IsLANG(t Token, lang string) bool {
+	if id := context.model.ctx.Whisper_lang_id(lang); id >= 0 {
+		return whisper.Token(t.Id) == context.model.ctx.Whisper_token_lang(id)
+	} else {
+		return false
+	}
+}
+
 ///////////////////////////////////////////////////////////////////////////////
 // PRIVATE METHODS

--- a/bindings/go/pkg/whisper/interface.go
+++ b/bindings/go/pkg/whisper/interface.go
@ -20,6 +20,9 @@ type Model interface {
 	// Return a new speech-to-text context.
 	NewContext() (Context, error)

+	// Return true if the model is multilingual.
+	IsMultilingual() bool
+
 	// Return all languages supported.
 	Languages() []string
 }
@ -27,8 +30,18 @@ type Model interface {
 // Context is the speach recognition context.
 type Context interface {
 	SetLanguage(string) error // Set the language to use for speech recognition.
+	SetTranslate(bool)        // Set translate flag
+	IsMultilingual() bool     // Return true if the model is multilingual.
 	Language() string         // Get language
-	SetSpeedup(bool)          // Set speedup flag
+
+	SetOffset(time.Duration)      // Set offset
+	SetDuration(time.Duration)    // Set duration
+	SetThreads(uint)              // Set number of threads to use
+	SetSpeedup(bool)              // Set speedup flag
+	SetTokenThreshold(float32)    // Set timestamp token probability threshold
+	SetTokenSumThreshold(float32) // Set timestamp token sum probability threshold
+	SetMaxSegmentLength(uint)     // Set max segment length in characters
+	SetMaxTokensPerSegment(uint)  // Set max tokens per segment (0 = no limit)

 	// Process mono audio data and return any errors.
 	// If defined, newly generated segments are passed to the
@ -38,6 +51,15 @@ type Context interface {
 	// After process is called, return segments until the end of the stream
 	// is reached, when io.EOF is returned.
 	NextSegment() (Segment, error)
+
+	IsBEG(Token) bool          // Test for "begin" token
+	IsSOT(Token) bool          // Test for "start of transcription" token
+	IsEOT(Token) bool          // Test for "end of transcription" token
+	IsPREV(Token) bool         // Test for "start of prev" token
+	IsSOLM(Token) bool         // Test for "start of lm" token
+	IsNOT(Token) bool          // Test for "No timestamps" token
+	IsLANG(Token, string) bool // Test for token associated with a specific language
+	IsText(Token) bool         // Test for text token
 }

 // Segment is the text result of a speech recognition.
--- a/bindings/go/pkg/whisper/model.go
+++ b/bindings/go/pkg/whisper/model.go
@ -23,7 +23,7 @@ var _ Model = (*model)(nil)
 ///////////////////////////////////////////////////////////////////////////////
 // LIFECYCLE

-func New(path string) (*model, error) {
+func New(path string) (Model, error) {
 	model := new(model)
 	if _, err := os.Stat(path); err != nil {
 		return nil, err
@ -64,6 +64,11 @@ func (model *model) String() string {
 ///////////////////////////////////////////////////////////////////////////////
 // PUBLIC METHODS

+// Return true if model is multilingual (language and translation options are supported)
+func (model *model) IsMultilingual() bool {
+	return model.ctx.Whisper_is_multilingual() != 0
+}
+
 // Return all recognized languages. Initially it is set to auto-detect
 func (model *model) Languages() []string {
 	result := make([]string, 0, whisper.Whisper_lang_max_id())
@ -91,5 +96,5 @@ func (model *model) NewContext() (Context, error) {
 	params.SetThreads(runtime.NumCPU())

 	// Return new context
-	return NewContext(model, params)
+	return newContext(model, params)
 }
--- a/bindings/go/whisper.go
+++ b/bindings/go/whisper.go
@ -91,7 +91,7 @@ var (
 func Whisper_init(path string) *Context {
 	cPath := C.CString(path)
 	defer C.free(unsafe.Pointer(cPath))
-	if ctx := C.whisper_init(cPath); ctx != nil {
+	if ctx := C.whisper_init_from_file(cPath); ctx != nil {
 		return (*Context)(ctx)
 	} else {
 		return nil
--- a/bindings/ios
+++ b/bindings/ios
--- a/bindings/javascript/emscripten.cpp
+++ b/bindings/javascript/emscripten.cpp
@ -20,7 +20,7 @@ struct whisper_context * g_context;
 EMSCRIPTEN_BINDINGS(whisper) {
    emscripten::function("init", emscripten::optional_override([](const std::string & path_model) {
        if (g_context == nullptr) {
-            g_context = whisper_init(path_model.c_str());
+            g_context = whisper_init_from_file(path_model.c_str());
            if (g_context != nullptr) {
                return true;
            } else {
--- a/bindings/javascript/package.json
+++ b/bindings/javascript/package.json
@ -1,6 +1,6 @@
 {
  "name": "whisper.cpp",
-  "version": "1.0.4",
+  "version": "1.1.0",
  "description": "Whisper speech recognition",
  "main": "whisper.js",
  "scripts": {
--- a/bindings/javascript/whisper.js
+++ b/bindings/javascript/whisper.js
--- a/examples/bench.wasm/emscripten.cpp
+++ b/examples/bench.wasm/emscripten.cpp
@ -28,6 +28,11 @@ void bench_main(size_t index) {
        return;
    }

+    {
+        fprintf(stderr, "\n");
+        fprintf(stderr, "system_info: n_threads = %d / %d | %s\n", n_threads, std::thread::hardware_concurrency(), whisper_print_system_info());
+    }
+
    if (int ret = whisper_encode(ctx, 0, n_threads) != 0) {
        fprintf(stderr, "error: failed to encode model: %d\n", ret);
        return;
@ -52,7 +57,7 @@ EMSCRIPTEN_BINDINGS(bench) {
    emscripten::function("init", emscripten::optional_override([](const std::string & path_model) {
        for (size_t i = 0; i < g_contexts.size(); ++i) {
            if (g_contexts[i] == nullptr) {
-                g_contexts[i] = whisper_init(path_model.c_str());
+                g_contexts[i] = whisper_init_from_file(path_model.c_str());
                if (g_contexts[i] != nullptr) {
                    if (g_worker.joinable()) {
                        g_worker.join();
--- a/examples/bench/bench.cpp
+++ b/examples/bench/bench.cpp
@ -53,7 +53,7 @@ int main(int argc, char ** argv) {

    // whisper init

-    struct whisper_context * ctx = whisper_init(params.model.c_str());
+    struct whisper_context * ctx = whisper_init_from_file(params.model.c_str());

    {
        fprintf(stderr, "\n");
--- a/examples/command.wasm/emscripten.cpp
+++ b/examples/command.wasm/emscripten.cpp
@ -324,7 +324,7 @@ EMSCRIPTEN_BINDINGS(command) {
    emscripten::function("init", emscripten::optional_override([](const std::string & path_model) {
        for (size_t i = 0; i < g_contexts.size(); ++i) {
            if (g_contexts[i] == nullptr) {
-                g_contexts[i] = whisper_init(path_model.c_str());
+                g_contexts[i] = whisper_init_from_file(path_model.c_str());
                if (g_contexts[i] != nullptr) {
                    g_running = true;
                    if (g_worker.joinable()) {
--- a/examples/command/command.cpp
+++ b/examples/command/command.cpp
@ -11,6 +11,7 @@
 #include <SDL.h>
 #include <SDL_audio.h>

+#include <sstream>
 #include <cassert>
 #include <cstdio>
 #include <fstream>
@ -25,7 +26,7 @@
 struct whisper_params {
    int32_t n_threads  = std::min(4, (int32_t) std::thread::hardware_concurrency());
    int32_t prompt_ms  = 5000;
-    int32_t command_ms = 4000;
+    int32_t command_ms = 8000;
    int32_t capture_id = -1;
    int32_t max_tokens = 32;
    int32_t audio_ctx  = 0;
@ -43,6 +44,7 @@ struct whisper_params {
    std::string model     = "models/ggml-base.en.bin";
    std::string fname_out;
    std::string commands;
+    std::string prompt;
 };

 void whisper_print_usage(int argc, char ** argv, const whisper_params & params);
@ -71,6 +73,7 @@ bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
        else if (arg == "-m"   || arg == "--model")         { params.model         = argv[++i]; }
        else if (arg == "-f"   || arg == "--file")          { params.fname_out     = argv[++i]; }
        else if (arg == "-cmd" || arg == "--commands")      { params.commands      = argv[++i]; }
+        else if (arg == "-p"   || arg == "--prompt")        { params.prompt        = argv[++i]; }
        else {
            fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
            whisper_print_usage(argc, argv, params);
@ -103,6 +106,7 @@ void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params & para
    fprintf(stderr, "  -m FNAME,   --model FNAME    [%-7s] model path\n",                                  params.model.c_str());
    fprintf(stderr, "  -f FNAME,   --file FNAME     [%-7s] text output file name\n",                       params.fname_out.c_str());
    fprintf(stderr, "  -cmd FNAME, --commands FNAME [%-7s] text file with allowed commands\n",             params.commands.c_str());
+    fprintf(stderr, "  -p,         --prompt         [%-7s] the required activation prompt\n",              params.prompt.c_str());
    fprintf(stderr, "\n");
 }

@ -510,331 +514,431 @@ std::vector<std::string> read_allowed_commands(const std::string & fname) {
    return allowed_commands;
 }

+std::vector<std::string> get_words(const std::string &txt) {
+    std::vector<std::string> words;
+
+    std::istringstream iss(txt);
+    std::string word;
+    while (iss >> word) {
+        words.push_back(word);
+    }
+
+    return words;
+}
+
+// returns true if no exit event was received
+bool process_sdl_events() {
+    SDL_Event event;
+    while (SDL_PollEvent(&event)) {
+        switch (event.type) {
+            case SDL_QUIT:
+                {
+                    return false;
+                } break;
+            default:
+                break;
+        }
+    }
+
+    return true;
+}
+
 // command-list mode
 // guide the transcription to match the most likely command from a provided list
 int process_command_list(struct whisper_context * ctx, audio_async &audio, const whisper_params &params) {
-   fprintf(stderr, "\n");
-   fprintf(stderr, "%s: guided mode\n", __func__);
+    fprintf(stderr, "\n");
+    fprintf(stderr, "%s: guided mode\n", __func__);

-   std::vector<std::string> allowed_commands = read_allowed_commands(params.commands);
+    std::vector<std::string> allowed_commands = read_allowed_commands(params.commands);

-   if (allowed_commands.empty()) {
-      fprintf(stderr, "%s: error: failed to read allowed commands from '%s'\n", __func__, params.commands.c_str());
-      return 2;
-   }
+    if (allowed_commands.empty()) {
+        fprintf(stderr, "%s: error: failed to read allowed commands from '%s'\n", __func__, params.commands.c_str());
+        return 2;
+    }

-   int max_len = 0;
+    int max_len = 0;

-   std::vector<std::vector<whisper_token>> allowed_tokens;
+    std::vector<std::vector<whisper_token>> allowed_tokens;

-   for (const auto & cmd : allowed_commands) {
-      whisper_token tokens[1024];
-      allowed_tokens.emplace_back();
+    for (const auto & cmd : allowed_commands) {
+        whisper_token tokens[1024];
+        allowed_tokens.emplace_back();

-      for (int l = 0; l < (int) cmd.size(); ++l) {
-         // NOTE: very important to add the whitespace !
-         //       the reason is that the first decoded token starts with a whitespace too!
-         std::string ss = std::string(" ") + cmd.substr(0, l + 1);
+        for (int l = 0; l < (int) cmd.size(); ++l) {
+            // NOTE: very important to add the whitespace !
+            //       the reason is that the first decoded token starts with a whitespace too!
+            std::string ss = std::string(" ") + cmd.substr(0, l + 1);

-         const int n = whisper_tokenize(ctx, ss.c_str(), tokens, 1024);
-         if (n < 0) {
-            fprintf(stderr, "%s: error: failed to tokenize command '%s'\n", __func__, cmd.c_str());
-            return 3;
-         }
-
-         if (n == 1) {
-            allowed_tokens.back().push_back(tokens[0]);
-         }
-      }
-
-      max_len = std::max(max_len, (int) cmd.size());
-   }
-
-   fprintf(stderr, "%s: allowed commands [ tokens ]:\n", __func__);
-   fprintf(stderr, "\n");
-   for (int i = 0; i < (int) allowed_commands.size(); ++i) {
-      fprintf(stderr, "  - \033[1m%-*s\033[0m = [", max_len, allowed_commands[i].c_str());
-      for (const auto & token : allowed_tokens[i]) {
-         fprintf(stderr, " %5d", token);
-      }
-      fprintf(stderr, " ]\n");
-   }
-
-   std::string  k_prompt = "select one from the available words: ";
-   for (int i = 0; i < (int) allowed_commands.size(); ++i) {
-      if (i > 0) {
-         k_prompt += ", ";
-      }
-      k_prompt += allowed_commands[i];
-   }
-   k_prompt += ". selected word: ";
-
-   // tokenize prompt
-   std::vector<whisper_token> k_tokens;
-   {
-      k_tokens.resize(1024);
-      const int n = whisper_tokenize(ctx, k_prompt.c_str(), k_tokens.data(), 1024);
-      if (n < 0) {
-         fprintf(stderr, "%s: error: failed to tokenize prompt '%s'\n", __func__, k_prompt.c_str());
-         return 4;
-      }
-      k_tokens.resize(n);
-   }
-
-   fprintf(stderr, "\n");
-   fprintf(stderr, "%s: prompt: '%s'\n", __func__, k_prompt.c_str());
-   fprintf(stderr, "%s: tokens: [", __func__);
-   for (const auto & token : k_tokens) {
-      fprintf(stderr, " %d", token);
-   }
-   fprintf(stderr, " ]\n");
-
-   fprintf(stderr, "\n");
-   fprintf(stderr, "%s: listening for a command ...\n", __func__);
-   fprintf(stderr, "\n");
-
-   bool is_running  = true;
-
-   std::vector<float> pcmf32_cur;
-   std::vector<float> pcmf32_prompt;
-
-   // main loop
-   while (is_running) {
-      // handle Ctrl + C
-      {
-         SDL_Event event;
-         while (SDL_PollEvent(&event)) {
-            switch (event.type) {
-               case SDL_QUIT:
-               {
-                  is_running = false;
-               } break;
-               default:
-                  break;
+            const int n = whisper_tokenize(ctx, ss.c_str(), tokens, 1024);
+            if (n < 0) {
+                fprintf(stderr, "%s: error: failed to tokenize command '%s'\n", __func__, cmd.c_str());
+                return 3;
            }
-         }

-         if (!is_running) {
-            return 0;
-         }
-      }
-
-      // delay
-      std::this_thread::sleep_for(std::chrono::milliseconds(100));
-
-      audio.get(2000, pcmf32_cur);
-
-      if (vad_simple(pcmf32_cur, WHISPER_SAMPLE_RATE, 1000, params.vad_thold, params.freq_thold, params.print_energy)) {
-         fprintf(stdout, "%s: Speech detected! Processing ...\n", __func__);
-
-         const auto t_start = std::chrono::high_resolution_clock::now();
-
-         whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
-
-         wparams.print_progress   = false;
-         wparams.print_special    = params.print_special;
-         wparams.print_realtime   = false;
-         wparams.print_timestamps = !params.no_timestamps;
-         wparams.translate        = params.translate;
-         wparams.no_context       = true;
-         wparams.single_segment   = true;
-         wparams.max_tokens       = 1;
-         wparams.language         = params.language.c_str();
-         wparams.n_threads        = params.n_threads;
-
-         wparams.audio_ctx        = params.audio_ctx;
-         wparams.speed_up         = params.speed_up;
-
-         wparams.prompt_tokens    = k_tokens.data();
-         wparams.prompt_n_tokens  = k_tokens.size();
-
-         // run the transformer and a single decoding pass
-         if (whisper_full(ctx, wparams, pcmf32_cur.data(), pcmf32_cur.size()) != 0) {
-            fprintf(stderr, "%s: ERROR: whisper_full() failed\n", __func__);
-            break;
-         }
-
-         const auto * probs = whisper_get_probs(ctx);
-         std::vector<std::pair<float, int>> probs_id;
-
-         double psum = 0.0;
-         for (int i = 0; i < (int) allowed_commands.size(); ++i) {
-            probs_id.emplace_back(probs[allowed_tokens[i][0]], i);
-            for (int j = 1; j < (int) allowed_tokens[i].size(); ++j) {
-               probs_id.back().first += probs[allowed_tokens[i][j]];
+            if (n == 1) {
+                allowed_tokens.back().push_back(tokens[0]);
            }
-            probs_id.back().first /= allowed_tokens[i].size();
-            psum += probs_id.back().first;
-         }
+        }

-         // normalize
-         for (auto & p : probs_id) {
-            p.first /= psum;
-         }
+        max_len = std::max(max_len, (int) cmd.size());
+    }

-         // sort descending
-         {
-            using pair_type = decltype(probs_id)::value_type;
-            std::sort(probs_id.begin(), probs_id.end(), [](const pair_type & a, const pair_type & b) {
-               return a.first > b.first;
-            });
-         }
+    fprintf(stderr, "%s: allowed commands [ tokens ]:\n", __func__);
+    fprintf(stderr, "\n");
+    for (int i = 0; i < (int) allowed_commands.size(); ++i) {
+        fprintf(stderr, "  - \033[1m%-*s\033[0m = [", max_len, allowed_commands[i].c_str());
+        for (const auto & token : allowed_tokens[i]) {
+            fprintf(stderr, " %5d", token);
+        }
+        fprintf(stderr, " ]\n");
+    }

-         // print the commands and the respective probabilities
-         {
-            fprintf(stdout, "\n");
-            for (const auto & cmd : probs_id) {
-               fprintf(stdout, "%s: %s%-*s%s = %f | ", __func__, "\033[1m", max_len, allowed_commands[cmd.second].c_str(), "\033[0m", cmd.first);
-               for (int token : allowed_tokens[cmd.second]) {
-                  fprintf(stdout, "'%4s' %f ", whisper_token_to_str(ctx, token), probs[token]);
-               }
-               fprintf(stdout, "\n");
+    std::string  k_prompt = "select one from the available words: ";
+    for (int i = 0; i < (int) allowed_commands.size(); ++i) {
+        if (i > 0) {
+            k_prompt += ", ";
+        }
+        k_prompt += allowed_commands[i];
+    }
+    k_prompt += ". selected word: ";
+
+    // tokenize prompt
+    std::vector<whisper_token> k_tokens;
+    {
+        k_tokens.resize(1024);
+        const int n = whisper_tokenize(ctx, k_prompt.c_str(), k_tokens.data(), 1024);
+        if (n < 0) {
+            fprintf(stderr, "%s: error: failed to tokenize prompt '%s'\n", __func__, k_prompt.c_str());
+            return 4;
+        }
+        k_tokens.resize(n);
+    }
+
+    fprintf(stderr, "\n");
+    fprintf(stderr, "%s: prompt: '%s'\n", __func__, k_prompt.c_str());
+    fprintf(stderr, "%s: tokens: [", __func__);
+    for (const auto & token : k_tokens) {
+        fprintf(stderr, " %d", token);
+    }
+    fprintf(stderr, " ]\n");
+
+    fprintf(stderr, "\n");
+    fprintf(stderr, "%s: listening for a command ...\n", __func__);
+    fprintf(stderr, "\n");
+
+    bool is_running  = true;
+
+    std::vector<float> pcmf32_cur;
+    std::vector<float> pcmf32_prompt;
+
+    // main loop
+    while (is_running) {
+        // handle Ctrl + C
+        is_running = process_sdl_events();
+
+        // delay
+        std::this_thread::sleep_for(std::chrono::milliseconds(100));
+
+        audio.get(2000, pcmf32_cur);
+
+        if (vad_simple(pcmf32_cur, WHISPER_SAMPLE_RATE, 1000, params.vad_thold, params.freq_thold, params.print_energy)) {
+            fprintf(stdout, "%s: Speech detected! Processing ...\n", __func__);
+
+            const auto t_start = std::chrono::high_resolution_clock::now();
+
+            whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
+
+            wparams.print_progress   = false;
+            wparams.print_special    = params.print_special;
+            wparams.print_realtime   = false;
+            wparams.print_timestamps = !params.no_timestamps;
+            wparams.translate        = params.translate;
+            wparams.no_context       = true;
+            wparams.single_segment   = true;
+            wparams.max_tokens       = 1;
+            wparams.language         = params.language.c_str();
+            wparams.n_threads        = params.n_threads;
+
+            wparams.audio_ctx        = params.audio_ctx;
+            wparams.speed_up         = params.speed_up;
+
+            wparams.prompt_tokens    = k_tokens.data();
+            wparams.prompt_n_tokens  = k_tokens.size();
+
+            // run the transformer and a single decoding pass
+            if (whisper_full(ctx, wparams, pcmf32_cur.data(), pcmf32_cur.size()) != 0) {
+                fprintf(stderr, "%s: ERROR: whisper_full() failed\n", __func__);
+                break;
            }
-         }

-         // best command
-         {
-            const auto t_end = std::chrono::high_resolution_clock::now();
+            // estimate command probability
+            // NOTE: not optimal
+            {
+                const auto * logits = whisper_get_logits(ctx);

-            const float prob = probs_id[0].first;
-            const int index = probs_id[0].second;
+                std::vector<float> probs(whisper_n_vocab(ctx), 0.0f);

+                // compute probs from logits via softmax
+                {
+                    float max = -1e9;
+                    for (int i = 0; i < (int) probs.size(); ++i) {
+                        max = std::max(max, logits[i]);
+                    }
+
+                    float sum = 0.0f;
+                    for (int i = 0; i < (int) probs.size(); ++i) {
+                        probs[i] = expf(logits[i] - max);
+                        sum += probs[i];
+                    }
+
+                    for (int i = 0; i < (int) probs.size(); ++i) {
+                        probs[i] /= sum;
+                    }
+                }
+
+                std::vector<std::pair<float, int>> probs_id;
+
+                double psum = 0.0;
+                for (int i = 0; i < (int) allowed_commands.size(); ++i) {
+                    probs_id.emplace_back(probs[allowed_tokens[i][0]], i);
+                    for (int j = 1; j < (int) allowed_tokens[i].size(); ++j) {
+                        probs_id.back().first += probs[allowed_tokens[i][j]];
+                    }
+                    probs_id.back().first /= allowed_tokens[i].size();
+                    psum += probs_id.back().first;
+                }
+
+                // normalize
+                for (auto & p : probs_id) {
+                    p.first /= psum;
+                }
+
+                // sort descending
+                {
+                    using pair_type = decltype(probs_id)::value_type;
+                    std::sort(probs_id.begin(), probs_id.end(), [](const pair_type & a, const pair_type & b) {
+                        return a.first > b.first;
+                    });
+                }
+
+                // print the commands and the respective probabilities
+                {
+                    fprintf(stdout, "\n");
+                    for (const auto & cmd : probs_id) {
+                        fprintf(stdout, "%s: %s%-*s%s = %f | ", __func__, "\033[1m", max_len, allowed_commands[cmd.second].c_str(), "\033[0m", cmd.first);
+                        for (int token : allowed_tokens[cmd.second]) {
+                            fprintf(stdout, "'%4s' %f ", whisper_token_to_str(ctx, token), probs[token]);
+                        }
+                        fprintf(stdout, "\n");
+                    }
+                }
+
+                // best command
+                {
+                    const auto t_end = std::chrono::high_resolution_clock::now();
+
+                    const float prob = probs_id[0].first;
+                    const int index = probs_id[0].second;
+
+                    fprintf(stdout, "\n");
+                    fprintf(stdout, "%s: detected command: %s%s%s | p = %f | t = %d ms\n", __func__,
+                            "\033[1m", allowed_commands[index].c_str(), "\033[0m", prob,
+                            (int) std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_start).count());
+                    fprintf(stdout, "\n");
+                }
+            }
+
+            audio.clear();
+        }
+    }
+
+    return 0;
+}
+
+// always-prompt mode
+// transcribe the voice into text after valid prompt
+int always_prompt_transcription(struct whisper_context * ctx, audio_async & audio, const whisper_params & params) {
+    bool is_running = true;
+    bool ask_prompt = true;
+
+    float prob = 0.0f;
+
+    std::vector<float> pcmf32_cur;
+
+    const std::string k_prompt = params.prompt;
+
+    const int k_prompt_length = get_words(k_prompt).size();
+
+    fprintf(stderr, "\n");
+    fprintf(stderr, "%s: always-prompt mode\n", __func__);
+
+    // main loop
+    while (is_running) {
+        // handle Ctrl + C
+        is_running = process_sdl_events();
+
+        // delay
+        std::this_thread::sleep_for(std::chrono::milliseconds(100));
+
+        if (ask_prompt) {
            fprintf(stdout, "\n");
-            fprintf(stdout, "%s: detected command: %s%s%s | p = %f | t = %d ms\n", __func__,
-                    "\033[1m", allowed_commands[index].c_str(), "\033[0m", prob,
-                  (int) std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_start).count());
+            fprintf(stdout, "%s: The prompt is: '%s%s%s'\n", __func__, "\033[1m", k_prompt.c_str(), "\033[0m");
            fprintf(stdout, "\n");
-         }

-         audio.clear();
-      }
-   }
+            ask_prompt = false;
+        }

-   return 0;
+        {
+            audio.get(2000, pcmf32_cur);
+
+            if (vad_simple(pcmf32_cur, WHISPER_SAMPLE_RATE, 1000, params.vad_thold, params.freq_thold, params.print_energy)) {
+                fprintf(stdout, "%s: Speech detected! Processing ...\n", __func__);
+
+                int64_t t_ms = 0;
+
+                // detect the commands
+                audio.get(params.command_ms, pcmf32_cur);
+
+                const auto txt = ::trim(::transcribe(ctx, params, pcmf32_cur, prob, t_ms));
+
+                const auto words = get_words(txt);
+
+                std::string prompt;
+                std::string command;
+
+                for (int i = 0; i < (int) words.size(); ++i) {
+                    if (i < k_prompt_length) {
+                        prompt += words[i] + " ";
+                    } else {
+                        command += words[i] + " ";
+                    }
+                }
+
+                const float sim = similarity(prompt, k_prompt);
+
+                //debug
+                //fprintf(stdout, "command size: %i\n", command_length);
+
+                if ((sim > 0.7f) && (command.size() > 0)) {
+                    fprintf(stdout, "%s: Command '%s%s%s', (t = %d ms)\n", __func__, "\033[1m", command.c_str(), "\033[0m", (int) t_ms);
+                }
+
+                fprintf(stdout, "\n");
+
+                audio.clear();
+            }
+        }
+    }
+
+    return 0;
 }

 // general-purpose mode
 // freely transcribe the voice into text
 int process_general_transcription(struct whisper_context * ctx, audio_async &audio, const whisper_params &params) {
-   bool is_running  = true;
-   bool have_prompt = false;
-   bool ask_prompt  = true;
+    bool is_running  = true;
+    bool have_prompt = false;
+    bool ask_prompt  = true;

-   float prob0 = 0.0f;
-   float prob  = 0.0f;
+    float prob0 = 0.0f;
+    float prob  = 0.0f;

-   std::vector<float> pcmf32_cur;
-   std::vector<float> pcmf32_prompt;
+    std::vector<float> pcmf32_cur;
+    std::vector<float> pcmf32_prompt;

-   const std::string k_prompt = "Ok Whisper, start listening for commands.";
+    const std::string k_prompt = "Ok Whisper, start listening for commands.";

-   fprintf(stderr, "\n");
-   fprintf(stderr, "%s: general-purpose mode\n", __func__);
+    fprintf(stderr, "\n");
+    fprintf(stderr, "%s: general-purpose mode\n", __func__);

-   // main loop
-   while (is_running) {
-      // handle Ctrl + C
-      {
-         SDL_Event event;
-         while (SDL_PollEvent(&event)) {
-            switch (event.type) {
-               case SDL_QUIT:
-               {
-                  is_running = false;
-               } break;
-               default:
-                  break;
+    // main loop
+    while (is_running) {
+        // handle Ctrl + C
+        is_running = process_sdl_events();
+
+        // delay
+        std::this_thread::sleep_for(std::chrono::milliseconds(100));
+
+        if (ask_prompt) {
+            fprintf(stdout, "\n");
+            fprintf(stdout, "%s: Say the following phrase: '%s%s%s'\n", __func__, "\033[1m", k_prompt.c_str(), "\033[0m");
+            fprintf(stdout, "\n");
+
+            ask_prompt = false;
+        }
+
+        {
+            audio.get(2000, pcmf32_cur);
+
+            if (vad_simple(pcmf32_cur, WHISPER_SAMPLE_RATE, 1000, params.vad_thold, params.freq_thold, params.print_energy)) {
+                fprintf(stdout, "%s: Speech detected! Processing ...\n", __func__);
+
+                int64_t t_ms = 0;
+
+                if (!have_prompt) {
+                    // wait for activation phrase
+                    audio.get(params.prompt_ms, pcmf32_cur);
+
+                    const auto txt = ::trim(::transcribe(ctx, params, pcmf32_cur, prob0, t_ms));
+
+                    fprintf(stdout, "%s: Heard '%s%s%s', (t = %d ms)\n", __func__, "\033[1m", txt.c_str(), "\033[0m", (int) t_ms);
+
+                    const float sim = similarity(txt, k_prompt);
+
+                    if (txt.length() < 0.8*k_prompt.length() || txt.length() > 1.2*k_prompt.length() || sim < 0.8f) {
+                        fprintf(stdout, "%s: WARNING: prompt not recognized, try again\n", __func__);
+                        ask_prompt = true;
+                    } else {
+                        fprintf(stdout, "\n");
+                        fprintf(stdout, "%s: The prompt has been recognized!\n", __func__);
+                        fprintf(stdout, "%s: Waiting for voice commands ...\n", __func__);
+                        fprintf(stdout, "\n");
+
+                        // save the audio for the prompt
+                        pcmf32_prompt = pcmf32_cur;
+                        have_prompt = true;
+                    }
+                } else {
+                    // we have heard the activation phrase, now detect the commands
+                    audio.get(params.command_ms, pcmf32_cur);
+
+                    // prepend the prompt audio
+                    pcmf32_cur.insert(pcmf32_cur.begin(), pcmf32_prompt.begin(), pcmf32_prompt.end());
+
+                    const auto txt = ::trim(::transcribe(ctx, params, pcmf32_cur, prob, t_ms));
+
+                    prob = 100.0f*(prob - prob0);
+
+                    //fprintf(stdout, "%s: heard '%s'\n", __func__, txt.c_str());
+
+                    // find the prompt in the text
+                    float best_sim = 0.0f;
+                    size_t best_len = 0;
+                    for (int n = 0.8*k_prompt.size(); n <= 1.2*k_prompt.size(); ++n) {
+                        const auto prompt = txt.substr(0, n);
+
+                        const float sim = similarity(prompt, k_prompt);
+
+                        //fprintf(stderr, "%s: prompt = '%s', sim = %f\n", __func__, prompt.c_str(), sim);
+
+                        if (sim > best_sim) {
+                            best_sim = sim;
+                            best_len = n;
+                        }
+                    }
+
+                    const std::string command = ::trim(txt.substr(best_len));
+
+                    fprintf(stdout, "%s: Command '%s%s%s', (t = %d ms)\n", __func__, "\033[1m", command.c_str(), "\033[0m", (int) t_ms);
+                    fprintf(stdout, "\n");
+                }
+
+                audio.clear();
            }
-         }
+        }
+    }

-         if (!is_running) {
-            return 0;
-         }
-      }
-
-      // delay
-      std::this_thread::sleep_for(std::chrono::milliseconds(100));
-
-      if (ask_prompt) {
-         fprintf(stdout, "\n");
-         fprintf(stdout, "%s: Say the following phrase: '%s%s%s'\n", __func__, "\033[1m", k_prompt.c_str(), "\033[0m");
-         fprintf(stdout, "\n");
-
-         ask_prompt = false;
-      }
-
-      {
-         audio.get(2000, pcmf32_cur);
-
-         if (vad_simple(pcmf32_cur, WHISPER_SAMPLE_RATE, 1000, params.vad_thold, params.freq_thold, params.print_energy)) {
-            fprintf(stdout, "%s: Speech detected! Processing ...\n", __func__);
-
-            int64_t t_ms = 0;
-
-            if (!have_prompt) {
-               // wait for activation phrase
-               audio.get(params.prompt_ms, pcmf32_cur);
-
-               const auto txt = ::trim(::transcribe(ctx, params, pcmf32_cur, prob0, t_ms));
-
-               fprintf(stdout, "%s: Heard '%s%s%s', (t = %d ms)\n", __func__, "\033[1m", txt.c_str(), "\033[0m", (int) t_ms);
-
-               const float sim = similarity(txt, k_prompt);
-
-               if (txt.length() < 0.8*k_prompt.length() || txt.length() > 1.2*k_prompt.length() || sim < 0.8f) {
-                  fprintf(stdout, "%s: WARNING: prompt not recognized, try again\n", __func__);
-                  ask_prompt = true;
-               } else {
-                  fprintf(stdout, "\n");
-                  fprintf(stdout, "%s: The prompt has been recognized!\n", __func__);
-                  fprintf(stdout, "%s: Waiting for voice commands ...\n", __func__);
-                  fprintf(stdout, "\n");
-
-                  // save the audio for the prompt
-                  pcmf32_prompt = pcmf32_cur;
-                  have_prompt = true;
-               }
-            } else {
-               // we have heard the activation phrase, now detect the commands
-               audio.get(params.command_ms, pcmf32_cur);
-
-               // prepend the prompt audio
-               pcmf32_cur.insert(pcmf32_cur.begin(), pcmf32_prompt.begin(), pcmf32_prompt.end());
-
-               const auto txt = ::trim(::transcribe(ctx, params, pcmf32_cur, prob, t_ms));
-
-               prob = 100.0f*(prob - prob0);
-
-               //fprintf(stdout, "%s: heard '%s'\n", __func__, txt.c_str());
-
-               // find the prompt in the text
-               float best_sim = 0.0f;
-               size_t best_len = 0;
-               for (int n = 0.8*k_prompt.size(); n <= 1.2*k_prompt.size(); ++n) {
-                  const auto prompt = txt.substr(0, n);
-
-                  const float sim = similarity(prompt, k_prompt);
-
-                  //fprintf(stderr, "%s: prompt = '%s', sim = %f\n", __func__, prompt.c_str(), sim);
-
-                  if (sim > best_sim) {
-                     best_sim = sim;
-                     best_len = n;
-                  }
-               }
-
-               const std::string command = ::trim(txt.substr(best_len));
-
-               fprintf(stdout, "%s: Command '%s%s%s', (t = %d ms)\n", __func__, "\033[1m", command.c_str(), "\033[0m", (int) t_ms);
-               fprintf(stdout, "\n");
-            }
-
-            audio.clear();
-         }
-      }
-   }
-
-   return 0;
+    return 0;
 }

 int main(int argc, char ** argv) {
@ -852,7 +956,7 @@ int main(int argc, char ** argv) {

    // whisper init

-    struct whisper_context * ctx = whisper_init(params.model.c_str());
+    struct whisper_context * ctx = whisper_init_from_file(params.model.c_str());

    // print some info about the processing
    {
@ -891,9 +995,11 @@ int main(int argc, char ** argv) {
    int  ret_val = 0;

    if (!params.commands.empty()) {
-       ret_val = process_command_list(ctx, audio, params);
+        ret_val = process_command_list(ctx, audio, params);
+    } else if (!params.prompt.empty()) {
+        ret_val = always_prompt_transcription(ctx, audio, params);
    } else {
-       ret_val = process_general_transcription(ctx, audio, params);
+        ret_val = process_general_transcription(ctx, audio, params);
    }

    audio.pause();
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@ -59,8 +59,12 @@ struct whisper_params {
    int32_t duration_ms  = 0;
    int32_t max_context  = -1;
    int32_t max_len      = 0;
+    int32_t best_of      = 5;
+    int32_t beam_size    = -1;

-    float word_thold = 0.01f;
+    float word_thold    = 0.01f;
+    float entropy_thold = 2.4f;
+    float logprob_thold = -1.0f;

    bool speed_up       = false;
    bool translate      = false;
@ -104,7 +108,11 @@ bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
        else if (arg == "-d"    || arg == "--duration")       { params.duration_ms    = std::stoi(argv[++i]); }
        else if (arg == "-mc"   || arg == "--max-context")    { params.max_context    = std::stoi(argv[++i]); }
        else if (arg == "-ml"   || arg == "--max-len")        { params.max_len        = std::stoi(argv[++i]); }
+        else if (arg == "-bo"   || arg == "--best-of")        { params.best_of        = std::stoi(argv[++i]); }
+        else if (arg == "-bs"   || arg == "--beam-size")      { params.beam_size      = std::stoi(argv[++i]); }
        else if (arg == "-wt"   || arg == "--word-thold")     { params.word_thold     = std::stof(argv[++i]); }
+        else if (arg == "-et"   || arg == "--entropy-thold")  { params.entropy_thold  = std::stof(argv[++i]); }
+        else if (arg == "-lpt"  || arg == "--logprob-thold")  { params.logprob_thold  = std::stof(argv[++i]); }
        else if (arg == "-su"   || arg == "--speed-up")       { params.speed_up       = true; }
        else if (arg == "-tr"   || arg == "--translate")      { params.translate      = true; }
        else if (arg == "-di"   || arg == "--diarize")        { params.diarize        = true; }
@ -136,31 +144,35 @@ void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params & para
    fprintf(stderr, "usage: %s [options] file0.wav file1.wav ...\n", argv[0]);
    fprintf(stderr, "\n");
    fprintf(stderr, "options:\n");
-    fprintf(stderr, "  -h,       --help           [default] show this help message and exit\n");
-    fprintf(stderr, "  -t N,     --threads N      [%-7d] number of threads to use during computation\n",    params.n_threads);
-    fprintf(stderr, "  -p N,     --processors N   [%-7d] number of processors to use during computation\n", params.n_processors);
-    fprintf(stderr, "  -ot N,    --offset-t N     [%-7d] time offset in milliseconds\n",                    params.offset_t_ms);
-    fprintf(stderr, "  -on N,    --offset-n N     [%-7d] segment index offset\n",                           params.offset_n);
-    fprintf(stderr, "  -d  N,    --duration N     [%-7d] duration of audio to process in milliseconds\n",   params.duration_ms);
-    fprintf(stderr, "  -mc N,    --max-context N  [%-7d] maximum number of text context tokens to store\n", params.max_context);
-    fprintf(stderr, "  -ml N,    --max-len N      [%-7d] maximum segment length in characters\n",           params.max_len);
-    fprintf(stderr, "  -wt N,    --word-thold N   [%-7.2f] word timestamp probability threshold\n",         params.word_thold);
-    fprintf(stderr, "  -su,      --speed-up       [%-7s] speed up audio by x2 (reduced accuracy)\n",        params.speed_up ? "true" : "false");
-    fprintf(stderr, "  -tr,      --translate      [%-7s] translate from source language to english\n",      params.translate ? "true" : "false");
-    fprintf(stderr, "  -di,      --diarize        [%-7s] stereo audio diarization\n",                       params.diarize ? "true" : "false");
-    fprintf(stderr, "  -otxt,    --output-txt     [%-7s] output result in a text file\n",                   params.output_txt ? "true" : "false");
-    fprintf(stderr, "  -ovtt,    --output-vtt     [%-7s] output result in a vtt file\n",                    params.output_vtt ? "true" : "false");
-    fprintf(stderr, "  -osrt,    --output-srt     [%-7s] output result in a srt file\n",                    params.output_srt ? "true" : "false");
-    fprintf(stderr, "  -owts,    --output-words   [%-7s] output script for generating karaoke video\n",     params.output_wts ? "true" : "false");
-    fprintf(stderr, "  -ocsv,    --output-csv     [%-7s] output result in a CSV file\n",                    params.output_csv ? "true" : "false");
-    fprintf(stderr, "  -ps,      --print-special  [%-7s] print special tokens\n",                           params.print_special ? "true" : "false");
-    fprintf(stderr, "  -pc,      --print-colors   [%-7s] print colors\n",                                   params.print_colors ? "true" : "false");
-    fprintf(stderr, "  -pp,      --print-progress [%-7s] print progress\n",                                 params.print_progress ? "true" : "false");
-    fprintf(stderr, "  -nt,      --no-timestamps  [%-7s] do not print timestamps\n",                        params.no_timestamps ? "false" : "true");
-    fprintf(stderr, "  -l LANG,  --language LANG  [%-7s] spoken language ('auto' for auto-detect)\n",       params.language.c_str());
-    fprintf(stderr, "            --prompt PROMPT  [%-7s] initial prompt\n",                                 params.prompt.c_str());
-    fprintf(stderr, "  -m FNAME, --model FNAME    [%-7s] model path\n",                                     params.model.c_str());
-    fprintf(stderr, "  -f FNAME, --file FNAME     [%-7s] input WAV file path\n",                            "");
+    fprintf(stderr, "  -h,       --help            [default] show this help message and exit\n");
+    fprintf(stderr, "  -t N,     --threads N       [%-7d] number of threads to use during computation\n",    params.n_threads);
+    fprintf(stderr, "  -p N,     --processors N    [%-7d] number of processors to use during computation\n", params.n_processors);
+    fprintf(stderr, "  -ot N,    --offset-t N      [%-7d] time offset in milliseconds\n",                    params.offset_t_ms);
+    fprintf(stderr, "  -on N,    --offset-n N      [%-7d] segment index offset\n",                           params.offset_n);
+    fprintf(stderr, "  -d  N,    --duration N      [%-7d] duration of audio to process in milliseconds\n",   params.duration_ms);
+    fprintf(stderr, "  -mc N,    --max-context N   [%-7d] maximum number of text context tokens to store\n", params.max_context);
+    fprintf(stderr, "  -ml N,    --max-len N       [%-7d] maximum segment length in characters\n",           params.max_len);
+    fprintf(stderr, "  -bo N,    --best-of N       [%-7d] number of best candidates to keep\n",              params.best_of);
+    fprintf(stderr, "  -bs N,    --beam-size N     [%-7d] beam size for beam search\n",                      params.beam_size);
+    fprintf(stderr, "  -wt N,    --word-thold N    [%-7.2f] word timestamp probability threshold\n",         params.word_thold);
+    fprintf(stderr, "  -et N,    --entropy-thold N [%-7.2f] entropy threshold for decoder fail\n",           params.entropy_thold);
+    fprintf(stderr, "  -lpt N,   --logprob-thold N [%-7.2f] log probability threshold for decoder fail\n",   params.logprob_thold);
+    fprintf(stderr, "  -su,      --speed-up        [%-7s] speed up audio by x2 (reduced accuracy)\n",        params.speed_up ? "true" : "false");
+    fprintf(stderr, "  -tr,      --translate       [%-7s] translate from source language to english\n",      params.translate ? "true" : "false");
+    fprintf(stderr, "  -di,      --diarize         [%-7s] stereo audio diarization\n",                       params.diarize ? "true" : "false");
+    fprintf(stderr, "  -otxt,    --output-txt      [%-7s] output result in a text file\n",                   params.output_txt ? "true" : "false");
+    fprintf(stderr, "  -ovtt,    --output-vtt      [%-7s] output result in a vtt file\n",                    params.output_vtt ? "true" : "false");
+    fprintf(stderr, "  -osrt,    --output-srt      [%-7s] output result in a srt file\n",                    params.output_srt ? "true" : "false");
+    fprintf(stderr, "  -owts,    --output-words    [%-7s] output script for generating karaoke video\n",     params.output_wts ? "true" : "false");
+    fprintf(stderr, "  -ocsv,    --output-csv      [%-7s] output result in a CSV file\n",                    params.output_csv ? "true" : "false");
+    fprintf(stderr, "  -ps,      --print-special   [%-7s] print special tokens\n",                           params.print_special ? "true" : "false");
+    fprintf(stderr, "  -pc,      --print-colors    [%-7s] print colors\n",                                   params.print_colors ? "true" : "false");
+    fprintf(stderr, "  -pp,      --print-progress  [%-7s] print progress\n",                                 params.print_progress ? "true" : "false");
+    fprintf(stderr, "  -nt,      --no-timestamps   [%-7s] do not print timestamps\n",                        params.no_timestamps ? "false" : "true");
+    fprintf(stderr, "  -l LANG,  --language LANG   [%-7s] spoken language ('auto' for auto-detect)\n",       params.language.c_str());
+    fprintf(stderr, "            --prompt PROMPT   [%-7s] initial prompt\n",                                 params.prompt.c_str());
+    fprintf(stderr, "  -m FNAME, --model FNAME     [%-7s] model path\n",                                     params.model.c_str());
+    fprintf(stderr, "  -f FNAME, --file FNAME      [%-7s] input WAV file path\n",                            "");
    fprintf(stderr, "\n");
 }

@ -235,7 +247,7 @@ void whisper_print_segment_callback(struct whisper_context * ctx, int n_new, voi
                const char * text = whisper_full_get_token_text(ctx, i, j);
                const float  p    = whisper_full_get_token_p   (ctx, i, j);

-                const int col = std::max(0, std::min((int) k_colors.size(), (int) (std::pow(p, 3)*float(k_colors.size()))));
+                const int col = std::max(0, std::min((int) k_colors.size() - 1, (int) (std::pow(p, 3)*float(k_colors.size()))));

                printf("%s%s%s%s", speaker.c_str(), k_colors[col].c_str(), text, "\033[0m");
            }
@ -331,20 +343,19 @@ bool output_csv(struct whisper_context * ctx, const char * fname) {
    const int n_segments = whisper_full_n_segments(ctx);
    for (int i = 0; i < n_segments; ++i) {
        const char * text = whisper_full_get_segment_text(ctx, i);
-	if (text[0] == ' ')
-	  text = text + sizeof(char); //whisper_full_get_segment_text() returns a string with leading space, point to the next character.
+        if (text[0] == ' ') {
+            text = text + sizeof(char); //whisper_full_get_segment_text() returns a string with leading space, point to the next character.
+        }
        const int64_t t0 = whisper_full_get_segment_t0(ctx, i);
        const int64_t t1 = whisper_full_get_segment_t1(ctx, i);
-	//need to multiply times returned from whisper_full_get_segment_t{0,1}() by 10 to get milliseconds.
-        fout << 10 * t0 << ", " 
-	     << 10 * t1 << ", \"" 
-	     << text    << "\"\n";
+
+        //need to multiply times returned from whisper_full_get_segment_t{0,1}() by 10 to get milliseconds.
+        fout << 10 * t0 << ", " << 10 * t1 << ", \"" << text    << "\"\n";
    }

    return true;
 }

-
 // karaoke video generation
 // outputs a bash script that uses ffmpeg to generate a video with the subtitles
 // TODO: font parameter adjustments
@ -478,7 +489,7 @@ int main(int argc, char ** argv) {

    // whisper init

-    struct whisper_context * ctx = whisper_init(params.model.c_str());
+    struct whisper_context * ctx = whisper_init_from_file(params.model.c_str());

    if (ctx == nullptr) {
        fprintf(stderr, "error: failed to initialize whisper context\n");
@ -620,6 +631,8 @@ int main(int argc, char ** argv) {
        {
            whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);

+            wparams.strategy = params.beam_size > 1 ? WHISPER_SAMPLING_BEAM_SEARCH : WHISPER_SAMPLING_GREEDY;
+
            wparams.print_realtime   = false;
            wparams.print_progress   = params.print_progress;
            wparams.print_timestamps = !params.no_timestamps;
@ -633,12 +646,18 @@ int main(int argc, char ** argv) {

            wparams.token_timestamps = params.output_wts || params.max_len > 0;
            wparams.thold_pt         = params.word_thold;
+            wparams.entropy_thold    = params.entropy_thold;
+            wparams.logprob_thold    = params.logprob_thold;
            wparams.max_len          = params.output_wts && params.max_len == 0 ? 60 : params.max_len;

            wparams.speed_up         = params.speed_up;

-            wparams.prompt_tokens    = prompt_tokens.empty() ? nullptr : prompt_tokens.data();
-            wparams.prompt_n_tokens  = prompt_tokens.empty() ? 0       : prompt_tokens.size();
+            wparams.greedy.best_of        = params.best_of;
+            wparams.beam_search.beam_size = params.beam_size;
+            wparams.temperature_inc = -1;
+
+            wparams.prompt_tokens     = prompt_tokens.empty() ? nullptr : prompt_tokens.data();
+            wparams.prompt_n_tokens   = prompt_tokens.empty() ? 0       : prompt_tokens.size();

            whisper_print_user_data user_data = { &params, &pcmf32s };

--- a/examples/stream.wasm/emscripten.cpp
+++ b/examples/stream.wasm/emscripten.cpp
@ -49,6 +49,9 @@ void stream_main(size_t index) {
    wparams.max_tokens       = 32;
    wparams.audio_ctx        = 768; // partial encoder context for better performance

+    // disable temperature fallback
+    wparams.temperature_inc  = -1.0f;
+
    wparams.language         = "en";

    printf("stream: using %d threads\n", wparams.n_threads);
@ -129,7 +132,7 @@ EMSCRIPTEN_BINDINGS(stream) {
    emscripten::function("init", emscripten::optional_override([](const std::string & path_model) {
        for (size_t i = 0; i < g_contexts.size(); ++i) {
            if (g_contexts[i] == nullptr) {
-                g_contexts[i] = whisper_init(path_model.c_str());
+                g_contexts[i] = whisper_init_from_file(path_model.c_str());
                if (g_contexts[i] != nullptr) {
                    g_running = true;
                    if (g_worker.joinable()) {
--- a/examples/stream/stream.cpp
+++ b/examples/stream/stream.cpp
@ -456,10 +456,10 @@ int main(int argc, char ** argv) {
        exit(0);
    }

-    struct whisper_context * ctx = whisper_init(params.model.c_str());
+    struct whisper_context * ctx = whisper_init_from_file(params.model.c_str());

    std::vector<float> pcmf32    (n_samples_30s, 0.0f);
-    std::vector<float> pcmf32_old(n_samples_30s, 0.0f);
+    std::vector<float> pcmf32_old;
    std::vector<float> pcmf32_new(n_samples_30s, 0.0f);

    std::vector<whisper_token> prompt_tokens;
@ -615,6 +615,9 @@ int main(int argc, char ** argv) {
            wparams.audio_ctx        = params.audio_ctx;
            wparams.speed_up         = params.speed_up;

+            // disable temperature fallback
+            wparams.temperature_inc  = -1.0f;
+
            wparams.prompt_tokens    = params.no_context ? nullptr : prompt_tokens.data();
            wparams.prompt_n_tokens  = params.no_context ? 0       : prompt_tokens.size();

--- a/examples/talk.wasm/emscripten.cpp
+++ b/examples/talk.wasm/emscripten.cpp
@ -271,7 +271,7 @@ EMSCRIPTEN_BINDINGS(talk) {
    emscripten::function("init", emscripten::optional_override([](const std::string & path_model) {
        for (size_t i = 0; i < g_contexts.size(); ++i) {
            if (g_contexts[i] == nullptr) {
-                g_contexts[i] = whisper_init(path_model.c_str());
+                g_contexts[i] = whisper_init_from_file(path_model.c_str());
                if (g_contexts[i] != nullptr) {
                    g_running = true;
                    if (g_worker.joinable()) {
--- a/examples/talk/talk.cpp
+++ b/examples/talk/talk.cpp
@ -498,7 +498,7 @@ int main(int argc, char ** argv) {

    // whisper init

-    struct whisper_context * ctx_wsp = whisper_init(params.model_wsp.c_str());
+    struct whisper_context * ctx_wsp = whisper_init_from_file(params.model_wsp.c_str());

    // gpt init

--- a/examples/whisper.android/app/src/main/java/com/whispercppdemo/ui/main/MainScreenViewModel.kt
+++ b/examples/whisper.android/app/src/main/java/com/whispercppdemo/ui/main/MainScreenViewModel.kt
@ -64,16 +64,22 @@ class MainScreenViewModel(private val application: Application) : ViewModel() {
    private suspend fun copyAssets() = withContext(Dispatchers.IO) {
        modelsPath.mkdirs()
        samplesPath.mkdirs()
-        application.copyData("models", modelsPath, ::printMessage)
+        //application.copyData("models", modelsPath, ::printMessage)
        application.copyData("samples", samplesPath, ::printMessage)
        printMessage("All data copied to working directory.\n")
    }

    private suspend fun loadBaseModel() = withContext(Dispatchers.IO) {
        printMessage("Loading model...\n")
-        val firstModel = modelsPath.listFiles()!!.first()
-        whisperContext = WhisperContext.createContext(firstModel.absolutePath)
-        printMessage("Loaded model ${firstModel.name}.\n")
+        val models = application.assets.list("models/")
+        if (models != null) {
+            val inputstream = application.assets.open("models/" + models[0])
+            whisperContext = WhisperContext.createContextFromInputStream(inputstream)
+            printMessage("Loaded model ${models[0]}.\n")
+        }
+
+        //val firstModel = modelsPath.listFiles()!!.first()
+        //whisperContext = WhisperContext.createContextFromFile(firstModel.absolutePath)
    }

    fun transcribeSample() = viewModelScope.launch {
--- a/examples/whisper.android/app/src/main/java/com/whispercppdemo/whisper/LibWhisper.kt
+++ b/examples/whisper.android/app/src/main/java/com/whispercppdemo/whisper/LibWhisper.kt
@ -4,6 +4,7 @@ import android.os.Build
 import android.util.Log
 import kotlinx.coroutines.*
 import java.io.File
+import java.io.InputStream
 import java.util.concurrent.Executors

 private const val LOG_TAG = "LibWhisper"
@ -39,13 +40,22 @@ class WhisperContext private constructor(private var ptr: Long) {
    }

    companion object {
-        fun createContext(filePath: String): WhisperContext {
+        fun createContextFromFile(filePath: String): WhisperContext {
            val ptr = WhisperLib.initContext(filePath)
            if (ptr == 0L) {
                throw java.lang.RuntimeException("Couldn't create context with path $filePath")
            }
            return WhisperContext(ptr)
        }
+
+        fun createContextFromInputStream(stream: InputStream): WhisperContext {
+            val ptr = WhisperLib.initContextFromInputStream(stream)
+
+            if (ptr == 0L) {
+                throw java.lang.RuntimeException("Couldn't create context from input stream")
+            }
+            return WhisperContext(ptr)
+        }
    }
 }

@ -76,6 +86,7 @@ private class WhisperLib {
        }

        // JNI methods
+        external fun initContextFromInputStream(inputStream: InputStream): Long
        external fun initContext(modelPath: String): Long
        external fun freeContext(contextPtr: Long)
        external fun fullTranscribe(contextPtr: Long, audioData: FloatArray)
--- a/examples/whisper.android/app/src/main/jni/whisper/jni.c
+++ b/examples/whisper.android/app/src/main/jni/whisper/jni.c
@ -2,6 +2,7 @@
 #include <android/log.h>
 #include <stdlib.h>
 #include <sys/sysinfo.h>
+#include <string.h>
 #include "whisper.h"

 #define UNUSED(x) (void)(x)
@ -17,13 +18,86 @@ static inline int max(int a, int b) {
    return (a > b) ? a : b;
 }

+struct input_stream_context {
+    size_t offset;
+    JNIEnv * env;
+    jobject thiz;
+    jobject input_stream;
+
+    jmethodID mid_available;
+    jmethodID mid_read;
+};
+
+size_t inputStreamRead(void * ctx, void * output, size_t read_size) {
+    struct input_stream_context* is = (struct input_stream_context*)ctx;
+
+    jint avail_size = (*is->env)->CallIntMethod(is->env, is->input_stream, is->mid_available);
+    jint size_to_copy = read_size < avail_size ? (jint)read_size : avail_size;
+
+    jbyteArray byte_array = (*is->env)->NewByteArray(is->env, size_to_copy);
+
+    jint n_read = (*is->env)->CallIntMethod(is->env, is->input_stream, is->mid_read, byte_array, 0, size_to_copy);
+
+    if (size_to_copy != read_size || size_to_copy != n_read) {
+        LOGI("Insufficient Read: Req=%zu, ToCopy=%d, Available=%d", read_size, size_to_copy, n_read);
+    }
+
+    jbyte* byte_array_elements = (*is->env)->GetByteArrayElements(is->env, byte_array, NULL);
+    memcpy(output, byte_array_elements, size_to_copy);
+    (*is->env)->ReleaseByteArrayElements(is->env, byte_array, byte_array_elements, JNI_ABORT);
+
+    (*is->env)->DeleteLocalRef(is->env, byte_array);
+
+    is->offset += size_to_copy;
+
+    return size_to_copy;
+}
+bool inputStreamEof(void * ctx) {
+    struct input_stream_context* is = (struct input_stream_context*)ctx;
+
+    jint result = (*is->env)->CallIntMethod(is->env, is->input_stream, is->mid_available);
+    return result <= 0;
+}
+void inputStreamClose(void * ctx) {
+
+}
+
+JNIEXPORT jlong JNICALL
+Java_com_whispercppdemo_whisper_WhisperLib_00024Companion_initContextFromInputStream(
+        JNIEnv *env, jobject thiz, jobject input_stream) {
+    UNUSED(thiz);
+
+    struct whisper_context *context = NULL;
+    struct whisper_model_loader loader = {};
+    struct input_stream_context inp_ctx = {};
+
+    inp_ctx.offset = 0;
+    inp_ctx.env = env;
+    inp_ctx.thiz = thiz;
+    inp_ctx.input_stream = input_stream;
+
+    jclass cls = (*env)->GetObjectClass(env, input_stream);
+    inp_ctx.mid_available = (*env)->GetMethodID(env, cls, "available", "()I");
+    inp_ctx.mid_read = (*env)->GetMethodID(env, cls, "read", "([BII)I");
+
+    loader.context = &inp_ctx;
+    loader.read = inputStreamRead;
+    loader.eof = inputStreamEof;
+    loader.close = inputStreamClose;
+
+    loader.eof(loader.context);
+
+    context = whisper_init(&loader);
+    return (jlong) context;
+}
+
 JNIEXPORT jlong JNICALL
 Java_com_whispercppdemo_whisper_WhisperLib_00024Companion_initContext(
        JNIEnv *env, jobject thiz, jstring model_path_str) {
    UNUSED(thiz);
    struct whisper_context *context = NULL;
    const char *model_path_chars = (*env)->GetStringUTFChars(env, model_path_str, NULL);
-    context = whisper_init(model_path_chars);
+    context = whisper_init_from_file(model_path_chars);
    (*env)->ReleaseStringUTFChars(env, model_path_str, model_path_chars);
    return (jlong) context;
 }
--- a/examples/whisper.android/local.properties
+++ b/examples/whisper.android/local.properties
@ -1,10 +0,0 @@
-## This file is automatically generated by Android Studio.
-# Do not modify this file -- YOUR CHANGES WILL BE ERASED!
-#
-# This file should *NOT* be checked into Version Control Systems,
-# as it contains information specific to your local configuration.
-#
-# Location of the SDK. This is only used by Gradle.
-# For customization when using a Version Control System, please read the
-# header note.
-sdk.dir=/Users/kevin/Library/Android/sdk
--- a/examples/whisper.objc/whisper.objc/ViewController.m
+++ b/examples/whisper.objc/whisper.objc/ViewController.m
@ -61,7 +61,7 @@ void AudioInputCallback(void * inUserData,
        NSLog(@"Loading model from %@", modelPath);

        // create ggml context
-        stateInp.ctx = whisper_init([modelPath UTF8String]);
+        stateInp.ctx = whisper_init_from_file([modelPath UTF8String]);

        // check if the model was loaded successfully
        if (stateInp.ctx == NULL) {
--- a/examples/whisper.swiftui/README.md
+++ b/examples/whisper.swiftui/README.md
@ -10,3 +10,5 @@ To use:
 5. Select the "release" build configuration under "Run", then deploy and run to your device.

 [^1]: I recommend the tiny, base or small models for running on an iOS device.
+
+![image](https://user-images.githubusercontent.com/1991296/212539216-0aef65e4-f882-480a-8358-0f816838fd52.png)
--- a/examples/whisper.swiftui/whisper.cpp.swift/LibWhisper.swift
+++ b/examples/whisper.swiftui/whisper.cpp.swift/LibWhisper.swift
@ -55,7 +55,7 @@ actor WhisperContext {
    }
    
    static func createContext(path: String) throws -> WhisperContext {
-        let context = whisper_init(path)
+        let context = whisper_init_from_file(path)
        if let context {
            return WhisperContext(context: context)
        } else {
--- a/examples/whisper.swiftui/whisper.swiftui.demo/Resources/models/.gitignore
+++ b/examples/whisper.swiftui/whisper.swiftui.demo/Resources/models/.gitignore
--- a/examples/whisper.swiftui/whisper.swiftui.demo/Resources/samples/.gitignore
+++ b/examples/whisper.swiftui/whisper.swiftui.demo/Resources/samples/.gitignore
--- a/examples/whisper.swiftui/whisper.swiftui.xcodeproj/project.pbxproj
+++ b/examples/whisper.swiftui/whisper.swiftui.xcodeproj/project.pbxproj
@ -35,10 +35,10 @@
 		0AAC5DA029539CD0003032C3 /* WhisperCppDemo.entitlements */ = {isa = PBXFileReference; lastKnownFileType = text.plist.entitlements; path = WhisperCppDemo.entitlements; sourceTree = "<group>"; };
 		0AAC5DA229539CD0003032C3 /* Preview Assets.xcassets */ = {isa = PBXFileReference; lastKnownFileType = folder.assetcatalog; path = "Preview Assets.xcassets"; sourceTree = "<group>"; };
 		0AAC5DC629539EAF003032C3 /* WhisperCppDemo-Bridging-Header.h */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.c.h; path = "WhisperCppDemo-Bridging-Header.h"; sourceTree = "<group>"; };
-		0AAC5DC729539EB0003032C3 /* whisper.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; name = whisper.cpp; path = ../../../whisper.cpp; sourceTree = "<group>"; };
-		0AAC5DC829539EB0003032C3 /* whisper.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = whisper.h; path = ../../../whisper.h; sourceTree = "<group>"; };
-		0AAC5DC929539EB0003032C3 /* ggml.c */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.c; name = ggml.c; path = ../../../ggml.c; sourceTree = "<group>"; };
-		0AAC5DCA29539EB0003032C3 /* ggml.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = ggml.h; path = ../../../ggml.h; sourceTree = "<group>"; };
+		0AAC5DC729539EB0003032C3 /* whisper.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = whisper.cpp; sourceTree = "<group>"; };
+		0AAC5DC829539EB0003032C3 /* whisper.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = whisper.h; sourceTree = "<group>"; };
+		0AAC5DC929539EB0003032C3 /* ggml.c */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.c; path = ggml.c; sourceTree = "<group>"; };
+		0AAC5DCA29539EB0003032C3 /* ggml.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = ggml.h; sourceTree = "<group>"; };
 		0AAC5DCD2953A05C003032C3 /* WhisperState.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = WhisperState.swift; sourceTree = "<group>"; };
 		0AAC5DD02953A394003032C3 /* LibWhisper.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = LibWhisper.swift; sourceTree = "<group>"; };
 /* End PBXFileReference section */
@ -129,7 +129,8 @@
 				0AAC5DC729539EB0003032C3 /* whisper.cpp */,
 				0AAC5DC829539EB0003032C3 /* whisper.h */,
 			);
-			path = whisper.cpp;
+			name = whisper.cpp;
+			path = ../..;
 			sourceTree = "<group>";
 		};
 		0AAC5DCF2953A36C003032C3 /* whisper.cpp.swift */ = {
--- a/examples/whisper.wasm/emscripten.cpp
+++ b/examples/whisper.wasm/emscripten.cpp
@ -18,7 +18,7 @@ EMSCRIPTEN_BINDINGS(whisper) {

        for (size_t i = 0; i < g_contexts.size(); ++i) {
            if (g_contexts[i] == nullptr) {
-                g_contexts[i] = whisper_init(path_model.c_str());
+                g_contexts[i] = whisper_init_from_file(path_model.c_str());
                if (g_contexts[i] != nullptr) {
                    return i + 1;
                } else {
--- a/ggml.c
+++ b/ggml.c
@ -84,7 +84,7 @@ typedef void* thread_ret_t;
 #define GGML_GELU_FP16

 #define GGML_SOFT_MAX_UNROLL 4
-#define GGML_VEC_DOT_UNROLL  4
+#define GGML_VEC_DOT_UNROLL  2

 #ifdef GGML_USE_ACCELERATE
 // uncomment to use vDSP for soft max computation
@ -923,9 +923,9 @@ inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t
 inline static void ggml_vec_dot_f16_unroll(const int n, const int xs, float * restrict s, void * restrict xv, ggml_fp16_t * restrict y) {
    ggml_float sumf[GGML_VEC_DOT_UNROLL] = { 0.0 };

-    const ggml_fp16_t * restrict x[GGML_VEC_DOT_UNROLL] = { xv };
+    ggml_fp16_t * restrict x[GGML_VEC_DOT_UNROLL];

-    for (int i = 1; i < GGML_VEC_DOT_UNROLL; ++i) {
+    for (int i = 0; i < GGML_VEC_DOT_UNROLL; ++i) {
        x[i] = (ggml_fp16_t *) ((char *) xv + i*xs);
    }

@ -1109,8 +1109,8 @@ inline static void ggml_vec_sum_f32(const int n, float * s, const float * x) {
    ggml_float sum = 0.0;
    for (int i = 0; i < n; ++i) {
        sum += x[i];
-        *s += sum;
    }
+    *s = sum;
 #else
    vDSP_sve(x, 1, s, n);
 #endif
@ -3724,8 +3724,6 @@ static void ggml_compute_forward_sum_f32(
    assert(ggml_is_scalar(dst));
    assert(src0->nb[0] == sizeof(float));

-    *(float *) (dst->data) = 0.0f;
-
    const int ne00 = src0->ne[0];
    const int ne01 = src0->ne[1];
    const int ne02 = src0->ne[2];
@ -3811,8 +3809,6 @@ static void ggml_compute_forward_mean_f32(
    for (int i03 = 0; i03 < ne03; i03++) {
        for (int i02 = 0; i02 < ne02; i02++) {
            for (int i01 = 0; i01 < ne01; i01++) {
-                *(float *) ((char *) dst->data + i01*nb1 + i02*nb2 + i03*nb3) = 0.0f;
-
                ggml_vec_sum_f32(ne00,
                        (float *) ((char *)  dst->data + i01*nb1  + i02*nb2  + i03*nb3),
                        (float *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03));
@ -4791,7 +4787,7 @@ static void ggml_compute_forward_mul_mat_f16_f32(
            }
        }
    } else {
-        // parallelize by src1 columns using ggml_vec_mad_f32
+        // parallelize by src1 columns using ggml_vec_mad_f16
        // each thread has its own work data
        // during FINALIZE we accumulate all work data into dst

@ -6158,40 +6154,37 @@ static void ggml_compute_forward_flash_attn_f16(
            S[i] = -INFINITY;
        }

-        // looks like unrolling here does not help
-#if 1
-        for (int ic = 0; ic < nek1; ++ic) {
-            // k indices
-            const int ik3 = iq3;
-            const int ik2 = iq2;
-            const int ik1 = ic;
+        if (GGML_VEC_DOT_UNROLL > 2 || nek1 % GGML_VEC_DOT_UNROLL != 0) {
+            for (int ic = 0; ic < nek1; ++ic) {
+                // k indices
+                const int ik3 = iq3;
+                const int ik2 = iq2;
+                const int ik1 = ic;

-            // S indices
-            const int i1 = ik1;
+                // S indices
+                const int i1 = ik1;

-            ggml_vec_dot_f16(neq0,
-                    S + i1,
-                    (ggml_fp16_t *) ((char *) k->data + (ik1*nbk1 + ik2*nbk2 + ik3*nbk3)),
-                    (ggml_fp16_t *) ((char *) q->data + (iq1*nbq1 + iq2*nbq2 + iq3*nbq3)));
+                ggml_vec_dot_f16(neq0,
+                        S + i1,
+                        (ggml_fp16_t *) ((char *) k->data + (ik1*nbk1 + ik2*nbk2 + ik3*nbk3)),
+                        (ggml_fp16_t *) ((char *) q->data + (iq1*nbq1 + iq2*nbq2 + iq3*nbq3)));
+            }
+        } else {
+            for (int ic = 0; ic < nek1; ic += GGML_VEC_DOT_UNROLL) {
+                // k indices
+                const int ik3 = iq3;
+                const int ik2 = iq2;
+                const int ik1 = ic;
+
+                // S indices
+                const int i1 = ik1;
+
+                ggml_vec_dot_f16_unroll(neq0, nbk1,
+                        S + i1,
+                        ((char *) k->data + (ik1*nbk1 + ik2*nbk2 + ik3*nbk3)),
+                        (ggml_fp16_t *) ((char *) q->data + (iq1*nbq1 + iq2*nbq2 + iq3*nbq3)));
+            }
        }
-#else
-        GGML_ASSERT(nek1 % GGML_VEC_DOT_UNROLL == 0);
-
-        for (int ic = 0; ic < nek1; ic += GGML_VEC_DOT_UNROLL) {
-            // k indices
-            const int ik3 = iq3;
-            const int ik2 = iq2;
-            const int ik1 = ic;
-
-            // S indices
-            const int i1 = ik1;
-
-            ggml_vec_dot_f16_unroll(neq0, nbk1,
-                    S + i1,
-                                    ((char *) k->data + (ik1*nbk1 + ik2*nbk2 + ik3*nbk3)),
-                    (ggml_fp16_t *) ((char *) q->data + (iq1*nbq1 + iq2*nbq2 + iq3*nbq3)));
-        }
-#endif

        // scale
        ggml_vec_scale_f32(nek1, S, scale);
@ -6261,18 +6254,30 @@ static void ggml_compute_forward_flash_attn_f16(
            S16[i] = GGML_FP32_TO_FP16(S[i]);
        }

-        GGML_ASSERT(nev1 % GGML_VEC_DOT_UNROLL == 0);
+        if (GGML_VEC_DOT_UNROLL == 1 || (nev1 % GGML_VEC_DOT_UNROLL != 0)) {
+            for (int ic = 0; ic < nev1; ++ic) {
+                // dst indices
+                const int i1 = iq1;
+                const int i2 = iq2;
+                const int i3 = iq3;

-        for (int ic = 0; ic < nev1; ic += GGML_VEC_DOT_UNROLL) {
-            // dst indices
-            const int i1 = iq1;
-            const int i2 = iq2;
-            const int i3 = iq3;
+                ggml_vec_dot_f16(nek1,
+                        (float *)       ((char *) dst->data + (ic*nb0 + i1*nb1  + i2*nb2  + i3*nb3)),
+                        (ggml_fp16_t *) ((char *) v->data   + (         ic*nbv1 + i2*nbv2 + i3*nbv3)),
+                        S16);
+            }
+        } else {
+            for (int ic = 0; ic < nev1; ic += GGML_VEC_DOT_UNROLL) {
+                // dst indices
+                const int i1 = iq1;
+                const int i2 = iq2;
+                const int i3 = iq3;

-            ggml_vec_dot_f16_unroll(nek1, nbv1,
-                    (float *) ((char *) dst->data + (ic*nb0 + i1*nb1  + i2*nb2  + i3*nb3)),
-                              ((char *) v->data   + (         ic*nbv1 + i2*nbv2 + i3*nbv3)),
-                    S16);
+                ggml_vec_dot_f16_unroll(nek1, nbv1,
+                        (float *) ((char *) dst->data + (ic*nb0 + i1*nb1  + i2*nb2  + i3*nb3)),
+                        ((char *) v->data   + (         ic*nbv1 + i2*nbv2 + i3*nbv3)),
+                        S16);
+            }
        }
    }
 }
--- a/whisper.cpp
+++ b/whisper.cpp
--- a/whisper.h
+++ b/whisper.h
@ -1,6 +1,7 @@
 #ifndef WHISPER_H
 #define WHISPER_H

+#include <stddef.h>
 #include <stdint.h>
 #include <stdbool.h>

@ -40,7 +41,7 @@ extern "C" {
    //
    //     ...
    //
-    //     struct whisper_context * ctx = whisper_init("/path/to/ggml-base.en.bin");
+    //     struct whisper_context * ctx = whisper_init_from_file("/path/to/ggml-base.en.bin");
    //
    //     if (whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size()) != 0) {
    //         fprintf(stderr, "failed to process audio\n");
@ -73,6 +74,7 @@ extern "C" {
        whisper_token tid; // forced timestamp token id

        float p;           // probability of the token
+        float plog;        // log probability of the token
        float pt;          // probability of the timestamp token
        float ptsum;       // sum of probabilities of all timestamp tokens

@ -84,9 +86,20 @@ extern "C" {
        float vlen;        // voice length of the token
    } whisper_token_data;

-    // Allocates all memory needed for the model and loads the model from the given file.
-    // Returns NULL on failure.
-    WHISPER_API struct whisper_context * whisper_init(const char * path_model);
+    typedef struct whisper_model_loader {
+        void * context;
+
+        size_t (*read)(void * ctx, void * output, size_t read_size);
+        bool    (*eof)(void * ctx);
+        void  (*close)(void * ctx);
+    } whisper_model_loader;
+
+    // Various functions for loading a ggml whisper model.
+    // Allocate (almost) all memory needed for the model.
+    // Return NULL on failure
+    WHISPER_API struct whisper_context * whisper_init_from_file(const char * path_model);
+    WHISPER_API struct whisper_context * whisper_init_from_buffer(void * buffer, size_t buffer_size);
+    WHISPER_API struct whisper_context * whisper_init(struct whisper_model_loader * loader);

    // Frees all memory allocated by the model.
    WHISPER_API void whisper_free(struct whisper_context * ctx);
@ -124,6 +137,7 @@ extern "C" {
    // tokens + n_tokens is the provided context for the decoder.
    // n_past is the number of tokens to use from previous decoder calls.
    // Returns 0 on success
+    // TODO: add support for multiple decoders
    WHISPER_API int whisper_decode(
            struct whisper_context * ctx,
               const whisper_token * tokens,
@ -131,14 +145,6 @@ extern "C" {
                               int   n_past,
                               int   n_threads);

-    // Token sampling methods.
-    // These are provided for convenience and can be used after each call to whisper_decode().
-    // You can also implement your own sampling method using the whisper_get_probs() function.
-    // whisper_sample_best() returns the token with the highest probability
-    // whisper_sample_timestamp() returns the most probable timestamp token
-    WHISPER_API whisper_token_data whisper_sample_best(struct whisper_context * ctx);
-    WHISPER_API whisper_token_data whisper_sample_timestamp(struct whisper_context * ctx, bool is_initial);
-
    // Convert the provided text into tokens.
    // The tokens pointer must be large enough to hold the resulting tokens.
    // Returns the number of tokens on success, no more than n_max_tokens
@ -180,8 +186,11 @@ extern "C" {
    WHISPER_API int whisper_n_audio_ctx    (struct whisper_context * ctx);
    WHISPER_API int whisper_is_multilingual(struct whisper_context * ctx);

-    // The probabilities for the next token
-    WHISPER_API float * whisper_get_probs(struct whisper_context * ctx);
+    // Token logits obtained from the last call to whisper_decode()
+    // The logits for the last token are stored in the last row
+    // Rows: n_tokens
+    // Cols: n_vocab
+    WHISPER_API float * whisper_get_logits(struct whisper_context * ctx);

    // Token Id -> String. Uses the vocabulary in the provided context
    WHISPER_API const char * whisper_token_to_str(struct whisper_context * ctx, whisper_token token);
@ -210,8 +219,8 @@ extern "C" {

    // Available sampling strategies
    enum whisper_sampling_strategy {
-        WHISPER_SAMPLING_GREEDY,      // Always select the most probable token
-        WHISPER_SAMPLING_BEAM_SEARCH, // TODO: not implemented yet!
+        WHISPER_SAMPLING_GREEDY,      // similar to OpenAI's GreefyDecoder
+        WHISPER_SAMPLING_BEAM_SEARCH, // similar to OpenAI's BeamSearchDecoder
    };

    // Text segment callback
@ -231,17 +240,17 @@ extern "C" {
        enum whisper_sampling_strategy strategy;

        int n_threads;
-        int n_max_text_ctx;
+        int n_max_text_ctx;     // max tokens to use from past text as prompt for the decoder
        int offset_ms;          // start offset in ms
        int duration_ms;        // audio duration to process in ms

        bool translate;
-        bool no_context;
+        bool no_context;        // do not use initial prompt for the decoder (if any)
        bool single_segment;    // force single segment output (useful for streaming)
-        bool print_special;
-        bool print_progress;
-        bool print_realtime;
-        bool print_timestamps;
+        bool print_special;     // print special tokens (e.g. <SOT>, <EOT>, <BEG>, etc.)
+        bool print_progress;    // print progress information
+        bool print_realtime;    // print results from within whisper.cpp (avoid it, use callback instead)
+        bool print_timestamps;  // print timestamps for each text segment when printing realtime

        // [EXPERIMENTAL] token-level timestamps
        bool  token_timestamps; // enable token-level timestamps
@ -251,10 +260,11 @@ extern "C" {
        int   max_tokens;       // max tokens per segment (0 = no limit)

        // [EXPERIMENTAL] speed-up techniques
+        // note: these can significantly reduce the quality of the output
        bool speed_up;          // speed-up the audio by 2x using Phase Vocoder
        int  audio_ctx;         // overwrite the audio context size (0 = use default)

-        // tokens to provide the whisper model as initial prompt
+        // tokens to provide to the whisper decoder as initial prompt
        // these are prepended to any existing text context from a previous call
        const whisper_token * prompt_tokens;
        int prompt_n_tokens;
@ -262,19 +272,35 @@ extern "C" {
        // for auto-detection, set to nullptr, "" or "auto"
        const char * language;

+        // common decoding parameters:
+        bool suppress_blank;    // ref: https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/decoding.py#L89
+
+        float temperature;      // initial decoding temperature, ref: https://ai.stackexchange.com/a/32478
+        float max_initial_ts;   // ref: https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/decoding.py#L97
+        float length_penalty;   // ref: https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/transcribe.py#L267
+
+        // fallback parameters
+        // ref: https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/transcribe.py#L274-L278
+        float temperature_inc;
+        float entropy_thold;    // similar to OpenAI's "compression_ratio_threshold"
+        float logprob_thold;
+        float no_speech_thold;  // TODO: not implemented
+
        struct {
-            int n_past;
+            int best_of;    // ref: https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/transcribe.py#L264
        } greedy;

        struct {
-            int n_past;
-            int beam_width;
-            int n_best;
+            int beam_size;  // ref: https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/transcribe.py#L265
+
+            float patience; // TODO: not implemented, ref: https://arxiv.org/pdf/2204.05424.pdf
        } beam_search;

+        // called for every newly generated text segment
        whisper_new_segment_callback new_segment_callback;
        void * new_segment_callback_user_data;

+        // called each time before the encoder starts
        whisper_encoder_begin_callback encoder_begin_callback;
        void * encoder_begin_callback_user_data;
    };
Author	SHA1	Message	Date
Georgi Gerganov	8738427dd6	cmake : bump version to 1.1.0	2023-01-15 14:33:13 +02:00
Georgi Gerganov	c3991bbb24	Update README.md	2023-01-15 14:08:12 +02:00
Georgi Gerganov	00ea21668b	whisper : account speed_up flag for short audio (close #405 )	2023-01-15 12:42:15 +02:00
Georgi Gerganov	0b85e8c401	Update README.md	2023-01-15 11:36:20 +02:00
Georgi Gerganov	fafd78945d	bench.wasm : print system info	2023-01-15 11:34:03 +02:00
Georgi Gerganov	8de452c18b	Improve decoding (#291 ) * whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders	2023-01-15 11:29:57 +02:00
Georgi Gerganov	a6dbd9188b	stream : fix a bug that inserted a lot of empty audio at the start The quality was terrible due to this	2023-01-14 19:20:47 +02:00
Georgi Gerganov	4ef3398e8f	ggml : remove obsolete zeroing + comment fixes (#390 )	2023-01-08 20:21:03 +02:00
Ian Bicking	5e9f33596f	readme : clarify main and stream usage (#391 ) Give an example of ./main that uses a sample file that's already there, and make the stream example clarify you need `make stream`	2023-01-08 20:18:41 +02:00
Abitofevrything	8d7b29cedd	ggml : correct behaviour of ggml_vec_sum_f32 (#390 )	2023-01-08 20:06:09 +02:00
boolemancer	08dc705a69	whisper : fix sample_to_timestamp calculation with 64 bit precision to avoid overflow (#388 ) * Do calculation with 64 bit precision to avoid overflow * Update whisper.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-01-08 15:08:45 +02:00
Syahmi Azhar	1512545149	whisper : add loader class to allow loading from buffer and others (#353 ) * whisper : add loader to allow loading from other than file * whisper : rename whisper_init to whisper_init_from_file * whisper : add whisper_init_from_buffer * android : Delete local.properties * android : load models directly from assets * whisper : adding <stddef.h> needed for size_t + code style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-01-08 13:03:33 +02:00
Georgi Gerganov	52a3e0c92a	ggml : improve vec_dot_f16 unrolling in flash_attn_f16	2023-01-08 11:41:18 +02:00
Georgi Gerganov	d1ea1220ff	command : clean-up / refactoring / formatting (#383 )	2023-01-07 21:43:24 +02:00
David	9c4a1522f6	command : always-prompt mode (#383 )	2023-01-07 21:41:11 +02:00
David Thorpe	f078a6f20e	go : adding features to the go-whisper example, go ci, etc (#384 ) * Updated bindings so they can be used in third pary packages. * Updated makefiles to set FMA flag on optionally, for xeon E5 on Darwin * Added test script * Changes for examples * Reverted * Made the NewContext method private	2023-01-07 21:21:43 +02:00