WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Resource type

Authors/contributors

Bain, Max (Author)
Huh, Jaesung (Author)
Han, Tengda (Author)
Zisserman, Andrew (Author)

Title

Abstract

Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination and repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut and Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference.

Repository

arXiv

Date

2023

DOI

10.48550/ARXIV.2303.00747

Citation Key

bain.etal_2023

URL

https://arxiv.org/abs/2303.00747

Accessed

17/04/2025, 13:25

Short Title

WhisperX

Library Catalog

DOI.org (Datacite)

License

Creative Commons Attribution 4.0 International

Extra

Version Number: 2

Notes

Other

Accepted to INTERSPEECH 2023

Citation

Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. arXiv. https://doi.org/10.48550/ARXIV.2303.00747

Link to this record

https://catalogue.yorvoice.york.ac.uk/catalogue/U5R4T79U

Relations