2024-10-28 13:43:27 +00:00
whispercpp
==========
![whisper.cpp ](https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg )
Ruby bindings for [whisper.cpp][], an interface of automatic speech recognition model.
Installation
------------
Install the gem and add to the application's Gemfile by executing:
$ bundle add whispercpp
If bundler is not being used to manage dependencies, install the gem by executing:
$ gem install whispercpp
Usage
-----
```ruby
require "whisper"
2024-12-09 11:17:50 +00:00
whisper = Whisper::Context.new(Whisper::Model["base"])
2024-10-28 13:43:27 +00:00
params = Whisper::Params.new
params.language = "en"
params.offset = 10_000
params.duration = 60_000
params.max_text_tokens = 300
params.translate = true
params.print_timestamps = false
2024-11-13 19:47:42 +00:00
params.initial_prompt = "Initial prompt here."
2024-10-28 13:43:27 +00:00
whisper.transcribe("path/to/audio.wav", params) do |whole_text|
puts whole_text
end
```
### Preparing model ###
2024-12-09 11:17:50 +00:00
Some models are prepared up-front:
2024-10-28 13:43:27 +00:00
2024-12-09 11:17:50 +00:00
```ruby
base_en = Whisper::Model["base.en"]
whisper = Whisper::Context.new(base_en)
```
At first time you use a model, it is downloaded automatically. After that, downloaded cached file is used. To clear cache, call `#clear_cache` :
```ruby
Whisper::Model["base"].clear_cache
2024-10-28 13:43:27 +00:00
```
2024-12-09 11:17:50 +00:00
You can see the list of prepared model names by `Whisper::Model.preconverted_model_names` :
```ruby
puts Whisper::Model.preconverted_model_names
# tiny
# tiny.en
# tiny-q5_1
# tiny.en-q5_1
# tiny-q8_0
# base
# base.en
# base-q5_1
# base.en-q5_1
# base-q8_0
# :
# :
```
You can also use local model files you prepared:
```ruby
whisper = Whisper::Context.new("path/to/your/model.bin")
```
Or, you can download model files:
```ruby
model_uri = Whisper::Model::URI.new("http://example.net/uri/of/your/model.bin")
whisper = Whisper::Context.new(model_uri)
```
See [models][] page for details.
2024-10-28 13:43:27 +00:00
### Preparing audio file ###
Currently, whisper.cpp accepts only 16-bit WAV files.
2024-12-09 11:17:50 +00:00
API
---
### Segments ###
2024-10-28 13:43:27 +00:00
Once `Whisper::Context#transcribe` called, you can retrieve segments by `#each_segment` :
```ruby
def format_time(time_ms)
sec, decimal_part = time_ms.divmod(1000)
min, sec = sec.divmod(60)
hour, min = min.divmod(60)
"%02d:%02d:%02d.%03d" % [hour, min, sec, decimal_part]
end
whisper.transcribe("path/to/audio.wav", params)
whisper.each_segment.with_index do |segment, index|
line = "[%{nth}: %{st} --> %{ed}] %{text}" % {
nth: index + 1,
st: format_time(segment.start_time),
ed: format_time(segment.end_time),
text: segment.text
}
line < < " (speaker turned)" if segment.speaker_next_turn?
puts line
end
```
You can also add hook to params called on new segment:
```ruby
def format_time(time_ms)
sec, decimal_part = time_ms.divmod(1000)
min, sec = sec.divmod(60)
hour, min = min.divmod(60)
"%02d:%02d:%02d.%03d" % [hour, min, sec, decimal_part]
end
# Add hook before calling #transcribe
params.on_new_segment do |segment|
line = "[%{st} --> %{ed}] %{text}" % {
st: format_time(segment.start_time),
ed: format_time(segment.end_time),
text: segment.text
}
line < < " (speaker turned)" if segment.speaker_next_turn?
puts line
end
whisper.transcribe("path/to/audio.wav", params)
```
2024-12-09 11:17:50 +00:00
### Models ###
2024-11-13 19:52:56 +00:00
You can see model information:
```ruby
2024-12-09 11:17:50 +00:00
whisper = Whisper::Context.new(Whisper::Model["base"])
2024-11-13 19:52:56 +00:00
model = whisper.model
model.n_vocab # => 51864
model.n_audio_ctx # => 1500
model.n_audio_state # => 512
model.n_audio_head # => 8
model.n_audio_layer # => 6
model.n_text_ctx # => 448
model.n_text_state # => 512
model.n_text_head # => 8
model.n_text_layer # => 6
model.n_mels # => 80
model.ftype # => 1
model.type # => "base"
```
2024-12-09 11:17:50 +00:00
### Logging ###
2024-11-13 19:52:56 +00:00
You can set log callback:
```ruby
prefix = "[MyApp] "
log_callback = ->(level, buffer, user_data) {
case level
when Whisper::LOG_LEVEL_NONE
puts "#{user_data}none: #{buffer}"
when Whisper::LOG_LEVEL_INFO
puts "#{user_data}info: #{buffer}"
when Whisper::LOG_LEVEL_WARN
puts "#{user_data}warn: #{buffer}"
when Whisper::LOG_LEVEL_ERROR
puts "#{user_data}error: #{buffer}"
when Whisper::LOG_LEVEL_DEBUG
puts "#{user_data}debug: #{buffer}"
when Whisper::LOG_LEVEL_CONT
puts "#{user_data}same to previous: #{buffer}"
end
}
Whisper.log_set log_callback, prefix
```
Using this feature, you are also able to suppress log:
```ruby
Whisper.log_set ->(level, buffer, user_data) {
# do nothing
}, nil
Whisper::Context.new(MODEL)
```
2024-12-09 11:17:50 +00:00
### Low-level API to transcribe ###
2024-11-28 08:33:07 +00:00
You can also call `Whisper::Context#full` and `#full_parallel` with a Ruby array as samples. Although `#transcribe` with audio file path is recommended because it extracts PCM samples in C++ and is fast, `#full` and `#full_parallel` give you flexibility.
```ruby
require "whisper"
require "wavefile"
reader = WaveFile::Reader.new("path/to/audio.wav", WaveFile::Format.new(:mono, :float, 16000))
samples = reader.enum_for(:each_buffer).map(& :samples).flatten
2024-12-09 11:17:50 +00:00
whisper = Whisper::Context.new(Whisper::Model["base"])
2024-11-28 08:33:07 +00:00
whisper.full(Whisper::Params.new, samples)
whisper.each_segment do |segment|
puts segment.text
end
```
The second argument `samples` may be an array, an object with `length` method, or a MemoryView. If you can prepare audio data as C array and export it as a MemoryView, whispercpp accepts and works with it with zero copy.
2024-11-13 19:52:56 +00:00
License
-------
The same to [whisper.cpp][].
2024-10-28 13:43:27 +00:00
[whisper.cpp]: https://github.com/ggerganov/whisper.cpp
[models]: https://github.com/ggerganov/whisper.cpp/tree/master/models