Has anyone already read this new article on ArXiv? https://arxiv.org/abs/2311.10770
Looks very promising, potential inference acceleration of PyTorch x30, and when implemented on native CUDA x117, and also an estimate of the maximum acceleration x341 times.
As far as I understand, this is achieved by replacing traditional forward propagation layers with so-called fast forward propagation layers.
Is there anyone here with real experience of contributing to the development of PyTorch, llama.cpp or releasing open models, what do you say to this?