Files
whisper.cpp/tests/earnings21/eval.py
Fujimoto Seiji b9d27b1358 tests : add a new benchmark test for long-form audio (#3185)
* tests : add a new benchmark test for long-form audio

Based on "Earnings-21" corpus by Del Rio et al.

    Earnings-21: A Practical Benchmark for ASR in the Wild (2021)
    https://arxiv.org/abs/2104.11348

This dataset contains 39 hours of long-form speech, sourced from public
earning calls. Each recording contains roughly 50 minutes of English
dialogues between multiple speakers (2-20 persons).

This benchmark suite should allow us to evaluate the performance of
whisper.cpp on long-form audio data.

Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>

* tests : apply PR feedback to 'earnings21/README.md'

Based on feedback from Daniel Bevenius.

 - Simplify how to download & prepare a Silero VAD model.
 - Fix typo: inferece -> inference

Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>

* tests : avoid crashing on non-UTF-8 characters

Based on feedback from Daniel Bevenius.

Add 'errors' parameter to open() in order to avoid unhandled
exception on invalid UTF-8 bytes.

Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>

* tests : try to interpret the hypothesis as Windows-1252

Based on the discussion in PR#3185.

Evidently Whisper.cpp can represent a quotation mark as '0x93', which
implifies Windows-1252 (Microsoft's ASCII excention), and cannot be
decoded by UTF-8.

Add an explicit decoding loop to address the issue.

Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>

---------

Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>
2025-05-28 07:08:44 +02:00

69 lines
1.9 KiB
Python

import os
import sys
import glob
import jiwer
from normalizers import EnglishTextNormalizer
def decode_hypothesis(b):
try:
# Depending on platforms, Whisper can emit a left double quotation
# mark (0x93), which is Microsoft's extension to ASCII. See #3185
# for the background.
return b.decode('windows-1252')
except UnicodeDecodeError:
return b.decode('utf-8', errors='ignore')
def get_reference():
ref = {}
for path in glob.glob("speech-datasets/earnings21/transcripts/nlp_references/*.nlp"):
code = os.path.basename(path).replace(".nlp", "")
buf = []
with open(path) as fp:
fp.readline()
for line in fp:
token = line.split("|", maxsplit=1)[0]
buf.append(token)
ref[code] = " ".join(buf)
return ref
def get_hypothesis():
hyp = {}
for path in glob.glob("speech-datasets/earnings21/media/*.mp3.txt"):
with open(path, 'rb') as fp:
text = decode_hypothesis(fp.read()).strip()
code = os.path.basename(path).replace(".mp3.txt", "")
hyp[code] = text
return hyp
def get_codes(metadata_csv):
codes = []
with open(metadata_csv) as fp:
fp.readline()
for line in fp:
codes.append(line.split(",")[0])
return sorted(codes)
def main():
if len(sys.argv) < 2:
print("Usage: %s METADATA_CSV" % sys.argv[0], file=sys.stderr)
return 1
metadata_csv = sys.argv[1]
normalizer = EnglishTextNormalizer()
ref_orig = get_reference()
hyp_orig = get_hypothesis()
ref_clean = []
hyp_clean = []
for code in get_codes(metadata_csv):
ref_clean.append(normalizer(ref_orig[code]))
hyp_clean.append(normalizer(hyp_orig[code]))
wer = jiwer.wer(ref_clean, hyp_clean)
print(f"WER: {wer * 100:.2f}%")
if __name__ == "__main__":
main()