Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything!

vatsadev · 1 year ago

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything!

fallingdowndizzyvr · 1 year ago

You can run a model of any size even without much RAM. As long as you have it on disk. Which you would need to have anyways. Use mmap. That maps the file as if it was RAM and runs directly off disk. It’ll be as slow as hell since it’s now bound by disk i/o. But unless you have a ton of system RAM. The method described here is also bound by disk i/o.