https://arxiv.org/abs/2311.10770
“UltraFastBERT”, apparently a variant of BERT, that uses only 0.3% of it’s neurons during inference, is performing on par with similar BERT models.
I hope that’s going to be available for all kinds of models in the near future!
Future is going to be interesting. With this kind of CPU speedup we can run blazing fast LLMs on a toaster if it has enough RAM.