Macs with 32GB of memory can run 70B models with the GPU.

fallingdowndizzyvr · 2 years ago

Macs with 32GB of memory can run 70B models with the GPU.

a_beautiful_rhind · 2 years ago

Pretty cool hack. Beats CPU inference at those speeds for sure.

Aaaaaaaaaeeeee · 2 years ago

The bandwidth utilization is not the best yet on gpu, its only 1/3rd of the potential 400GB/s.

The cpu RAM bandwidth utilization in llama.cpp on the otherhand, is nearly 100%, For my 32gb of DDR4, I get 1.5t/s with the 70B Q3_K_S model.

There will hopefully be more optimizations to speed this up.

fallingdowndizzyvr · 2 years ago

I can’t wait for ultrafastbert. If that delivers on the promise then it’s a game changer that will propel CPU inference to the front of the pack. For 7B models up to a 78x speedup. The speedup decreases as the number of layers increase, but I’m hoping at 70B it’ll still be pretty significant.