Found out about air_llm, https://github.com/lyogavin/Anima/tree/main/air_llm, where it loads one layer at a time, allow each layer to be 1.6GB for a 70b with 80 layers. theres about 30mb for kv cache, and i’m not sure where the rest goes.
works with HF out of the box too apparently. The weaknesses appear to be ctxlen, and its gonna be slow, but anyway, anyone want to try goliath 120B unquant?
Hey there! I think this is doing offloading?
If so, it’s not a new thing. Check out https://huggingface.co/docs/accelerate/usage_guides/big_modeling for a guide with code and videos about it
one of those cases where proving something can be done doesn’t make it useful. This has to be one of the least efficient ways to do inferencing. Like the people who got Doom running on a HP printer. Great you did it but it’s the worst possible version.