https://arxiv.org/abs/2311.10770
“UltraFastBERT”, apparently a variant of BERT, that uses only 0.3% of it’s neurons during inference, is performing on par with similar BERT models.
I hope that’s going to be available for all kinds of models in the near future!
If it works that way, it will only be short term. Since the only reason it doesn’t run on a GPU is because of conditional matrix OPs. So the GPU makers will just add them. Then they’ll will be back on top with the same margins again.
Also, they say the speedup decreases with more layers. So the bigger the model, the less the benefit. A 512B model is much bigger than a 7B model thus the speedup will be much less. Possibly none.