https://arxiv.org/abs/2311.10770
“UltraFastBERT”, apparently a variant of BERT, that uses only 0.3% of it’s neurons during inference, is performing on par with similar BERT models.
I hope that’s going to be available for all kinds of models in the near future!
You must log in or register to comment.
Future is going to be interesting. With this kind of CPU speedup we can run blazing fast LLMs on a toaster if it has enough RAM.
Would be interesting to see if this can help speed up CPU inference with regular RAM, after all 128 GB of DDR5 only costs like $300 which is peanuts compared to trying to get any where close as much VRAM.
If it scales linearly then one could run a 100B model at the speed of a 3B one right now.
Remind Me! 15 Day “ 40x BERT ”
Basically gpt 4 turbo