Macs with 32GB of memory can run 70B models with the GPU.

fallingdowndizzyvr · 1 year ago

Macs with 32GB of memory can run 70B models with the GPU.

fallingdowndizzyvr · 1 year ago

That’s where context shifting comes into play. You don’t re-evaluate the entire context. You just process the additions.

“Previously, we had to re-evaluate the context when it becomes full and this could take a lot of time, especially on the CPU. Now, this is avoided by correctly updating the KV cache on-the-fly:”

https://github.com/ggerganov/llama.cpp/pull/3228