I recently got a 32GB M1 Mac Studio. I was excited to see how big of a model it could run. It turns out that’s 70B. It is a Q3_K_S model so the 2nd smallest for 70B in GGUF format, but still it’s a 70B model.
As many people know, the Mac shouldn’t be able to dedicate that much RAM to the GPU. Apple limits it to 67%, which is about 21GB. This model is 28GB. So it shouldn’t fit. But there’s a solution to that thanks to these smart people here.
https://github.com/ggerganov/llama.cpp/discussions/2182
They wrote a program to patch that limit in the kernel. You can set it to anything you want. So I cranked mine up to 92%. I also do these couple of things to save RAM.
-
I don’t use the GUI. Just simply logging in and doing nothing uses a fair amount of RAM. I run my Mac headless. I ssh in.
-
I stopped the mds_stores process from running. I saw that it was using up between 500MB and 1GB of RAM. Its the processes that indexes the drives for faster search. Considering my drive is 97% empty, I don’t know what it was doing to use up 1GB of RAM. I normally turn off indexing on all my machines always.
With all that set, the highest I’ve seen in use memory is 31.02GB while running a 70B Q3_K_S model. So there’s headroom. There maybe a lot more. Since my goal is to not swap. I noticed that when I log into the GUI while it’s running a model, the compressed RAM goes up to around 750MB but it still doesn’t swap. So I wonder how far memory compression would let me stretch it. I do notice that it’s not as snappy. With no GUI login, the model runs right away after the model is cached after the first run. With a GUI login, it pauses for a few seconds.
As for performance, it’s 14 t/s prompt and 4 t/s generation using the GPU. It’s 2 and 2 using the CPU. Power consumption is remarkably low. Using the GPU, powermetrics reports 39 watts for the entire machine but my wall monitor says it’s taking 79 watts from the wall. Using the CPU powermetrics reports 36 watts and the wall monitor says 63 watts. I don’t know why it’s so much more efficient at the wall between GPU and CPU. It’s only a 3 watt difference in the machine but 16 watts at the wall.
All in all, I’m super impressed. The M1 32GB Studio may be the runt of the Mac Studio lineup but considering that I paid about what a used 3090 costs on ebay for a new one, I think it’s the best value for performance I have to run LLMs. Since I plan on running this all out 24/7/365, the power savings alone compared to anything else with a GPU will be several hundreds of dollars a year.
As so often happens, the real LPT is in the comments. Using sysctl to change vram allocation is amazing. Thanks for this post.
Absolutely. That is a much better way to do it. But that was a recent development. More recent than this thread. That post at github about doing it that way happened 3 hours after I posted this thread.
There are no new 3090 so comparing the cost to a new 3090 is pointless as its basically just scalped overprized new 3090s left.
There are no new 3090 so comparing the cost to a new 3090 is pointless as its basically just scalped overprized new 3090s left.
I’m not comparing it to the cost of a new 3090. I clearly said I was comparing it to the price of a used 3090.
“The M1 32GB Studio may be the runt of the Mac Studio lineup but considering that I paid about what a used 3090 costs on ebay for a new one”
Awesome! I think I remember us talking about this at some point, but I didn’t have the courage to try it on my own machine. You’re the first person I’ve seen actually do the deed, and now I want to as well =D The 192GB Mac Studio stops at 147GB… I also run headless, so I can’t fathom that stupid bricks really needs 45GB of RAM to do normal stuff lol.
I am inspired. I’ll give it a go this weekend! Great work =D
Pretty cool hack. Beats CPU inference at those speeds for sure.
The bandwidth utilization is not the best yet on gpu, its only 1/3rd of the potential 400GB/s.
The cpu RAM bandwidth utilization in llama.cpp on the otherhand, is nearly 100%, For my 32gb of DDR4, I get 1.5t/s with the 70B Q3_K_S model.
There will hopefully be more optimizations to speed this up.
I can’t wait for ultrafastbert. If that delivers on the promise then it’s a game changer that will propel CPU inference to the front of the pack. For 7B models up to a 78x speedup. The speedup decreases as the number of layers increase, but I’m hoping at 70B it’ll still be pretty significant.