Hey r/MachineLearning!
At Hugging Face, we’ve worked hard the last months to create a powerful, but fast distilled version of Whisper. We’re excited to share our work with you now!
Distil-Whisper is 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution datasets. On long-form audio, we even achieve better results thanks to a reduction in hallucinations.
For more information, please have a look:
- GitHub page: https://github.com/huggingface/distil-whisper/tree/main
- Paper: https://github.com/huggingface/distil-whisper/blob/main/Distil_Whisper.pdf
Quick summary:
- Distillation Process
We’ve kept the whole encoder, but reduced the decoder to just 2 layers. Encoding takes O(1) forward passes, decoding takes O(N). To improve speed, all that matters is the decoder! The encoder is frozen during distillation while we fine-tune all of the decoder. Both KL loss and pseudo-labeling next word prediction is used.
- Data
We use 20,000h of open-sourced audio data coming from 9 diverse audio datasets. A WER-filter is used to make sure low-quality training data is thrown out.
- Results
We’ve evaluated the model only on out-of-distribution datasets and are only 1% worse than Whisper-large-v2 on short-form evals (CHiME-4, Earnings-22, FLEURS, SPGISpeech). On long-form evals (Earnings, Meanwhile, Rev 16) we beat Whisper-large-v2 thanks to a reduction in hallucinations.
- Robust to noise
Distil-Whisper is very robust to noise (similar to its teacher). We credit this to keeping the original encoder frozen during training.
- Pushing for max inference time
Distil-Whisper is 6x faster than Whisper on both short-form and long-form audio. In addition, we employ Flash Attention and chunked decoding which helps us achieve a real-time factor of 0.01!
- Checkpoints?!
Checkpoints will be released this Thursday and will be directly integrated into Transformers. All checkpoints will be licensed under MIT.