Hey r/MachineLearning!

At Hugging Face, we’ve worked hard the last months to create a powerful, but fast distilled version of Whisper. We’re excited to share our work with you now!

Distil-Whisper is 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution datasets. On long-form audio, we even achieve better results thanks to a reduction in hallucinations.

For more information, please have a look:

- GitHub page: https://github.com/huggingface/distil-whisper/tree/main

- Paper: https://github.com/huggingface/distil-whisper/blob/main/Distil_Whisper.pdf

Quick summary:

  1. Distillation Process

We’ve kept the whole encoder, but reduced the decoder to just 2 layers. Encoding takes O(1) forward passes, decoding takes O(N). To improve speed, all that matters is the decoder! The encoder is frozen during distillation while we fine-tune all of the decoder. Both KL loss and pseudo-labeling next word prediction is used.

  1. Data

We use 20,000h of open-sourced audio data coming from 9 diverse audio datasets. A WER-filter is used to make sure low-quality training data is thrown out.

  1. Results

We’ve evaluated the model only on out-of-distribution datasets and are only 1% worse than Whisper-large-v2 on short-form evals (CHiME-4, Earnings-22, FLEURS, SPGISpeech). On long-form evals (Earnings, Meanwhile, Rev 16) we beat Whisper-large-v2 thanks to a reduction in hallucinations.

  1. Robust to noise

Distil-Whisper is very robust to noise (similar to its teacher). We credit this to keeping the original encoder frozen during training.

  1. Pushing for max inference time

Distil-Whisper is 6x faster than Whisper on both short-form and long-form audio. In addition, we employ Flash Attention and chunked decoding which helps us achieve a real-time factor of 0.01!

  1. Checkpoints?!

Checkpoints will be released this Thursday and will be directly integrated into Transformers. All checkpoints will be licensed under MIT.