Has anyone already read this new article on ArXiv? https://arxiv.org/abs/2311.10770

Looks very promising, potential inference acceleration of PyTorch x30, and when implemented on native CUDA x117, and also an estimate of the maximum acceleration x341 times.

As far as I understand, this is achieved by replacing traditional forward propagation layers with so-called fast forward propagation layers.

Is there anyone here with real experience of contributing to the development of PyTorch, llama.cpp or releasing open models, what do you say to this?

  • BalorNGB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    I say:

    1. It has a performance hit, but it remains to be seen if going with a much larger model can compensate for that.
    2. The model needs to be trained from scratch, you cannot finetune an existing model for this apparently…