Introducing Distil-Whisper: 6x faster than Whisper while performing to within 1% WER on out-of-distribution test data.

Through careful data selection and filtering, Whisper’s robustness to noise is maintained and hallucinations reduced.

For more information, refer to:

Here’s a quick overview of how it works:

1. Distillation

The Whisper encoder performs 1 forward pass, while the decoder performs as many as the number of tokens generated. That means that the decoder accounts for >90% of the total inference time. Therefore, reducing decoder layers is more effective than encoder layers.

With this in mind, we keep the whole encoder, but only 2 decoder layers. The resulting model is then 6x faster. A weighted distillation loss is used to train the model, keeping the encoder frozen 🔒 This ensures we inherit Whisper’s robustness to noise and different audio distributions.

Figure 1: Architecture of the Distil-Whisper model. We retain all 32 encoder layers, but only 2 decoder layers (the first and the last). This results in 6x faster inference speed.

2. Data

Distil-Whisper is trained on a diverse corpus of 22,000 hours of audio from 9 open-sourced datasets with permissive license. Pseudo-labels are generated using Whisper to give the labels for training. Importantly, a WER filter is applied so that only labels that score above 10% WER are kept. This is key to keeping performance! 🔑

3. Results

Distil-Whisper is 6x faster than Whisper, while sacrificing only 1% on short-form evaluation. On long-form evaluation, Distil-Whisper beats Whisper. We show that this is because Distil-Whisper hallucinates less

4. Usage

Checkpoints are released under the Distil-Whisper repository with a direct integration in 🤗 Transformers and an MIT license.

5. Training Code

Training code will be released in the Distil-Whisper repository this week, enabling anyone in the community to distill a Whisper model in their choice of language!