Introducing Distil-Whisper: 6x faster than Whisper while performing to within 1% WER on out-of-distribution test data.
Through careful data selection and filtering, Whisper’s robustness to noise is maintained and hallucinations reduced.
For more information, refer to:
- 👨💻 The GitHub repo: https://github.com/huggingface/distil-whisper
- 📚 The official paper: https://arxiv.org/abs/2311.00430
Here’s a quick overview of how it works:
1. Distillation
The Whisper encoder performs 1 forward pass, while the decoder performs as many as the number of tokens generated. That means that the decoder accounts for >90% of the total inference time. Therefore, reducing decoder layers is more effective than encoder layers.
With this in mind, we keep the whole encoder, but only 2 decoder layers. The resulting model is then 6x faster. A weighted distillation loss is used to train the model, keeping the encoder frozen 🔒 This ensures we inherit Whisper’s robustness to noise and different audio distributions.
2. Data
Distil-Whisper is trained on a diverse corpus of 22,000 hours of audio from 9 open-sourced datasets with permissive license. Pseudo-labels are generated using Whisper to give the labels for training. Importantly, a WER filter is applied so that only labels that score above 10% WER are kept. This is key to keeping performance! 🔑
3. Results
Distil-Whisper is 6x faster than Whisper, while sacrificing only 1% on short-form evaluation. On long-form evaluation, Distil-Whisper beats Whisper. We show that this is because Distil-Whisper hallucinates less
4. Usage
Checkpoints are released under the Distil-Whisper repository with a direct integration in 🤗 Transformers and an MIT license.
5. Training Code
Training code will be released in the Distil-Whisper repository this week, enabling anyone in the community to distill a Whisper model in their choice of language!