Found out about air_llm, https://github.com/lyogavin/Anima/tree/main/air_llm, where it loads one layer at a time, allow each layer to be 1.6GB for a 70b with 80 layers. theres about 30mb for kv cache, and i’m not sure where the rest goes.

works with HF out of the box too apparently. The weaknesses appear to be ctxlen, and its gonna be slow, but anyway, anyone want to try goliath 120B unquant?

  • fallingdowndizzyvrB
    link
    fedilink
    English
    arrow-up
    0
    ·
    10 months ago

    You can run a model of any size even without much RAM. As long as you have it on disk. Which you would need to have anyways. Use mmap. That maps the file as if it was RAM and runs directly off disk. It’ll be as slow as hell since it’s now bound by disk i/o. But unless you have a ton of system RAM. The method described here is also bound by disk i/o.