Cranking the performance on M3 Max

blackstonewine · 1 year ago

Cranking the performance on M3 Max

SomeOddCodeGuy · 1 year ago

I’ll be interested to see what responses you get, but I’m gonna come out and say that the Mac’s power is NOT its speed. Pound for pound, a CUDA video card is going to absolutely leave our machines in the dust.

So, with that said- I actually think your 20 tokens a second is kind of great. I mean- my M2 Ultra is two M2 Max processors stacked on top of each other, and I get the following for Mythomax-l2-13b:

Llama.cpp directly:
- Prompt eval: 17.79ms per token, 56.22 tokens per second
- Eval: 28.27ms per token, 35.38 tokens per second
- 565 tokens in 15.86 seconds: 35.6 tokens per second
Llama cpp python in Oobabooga:
- Prompt eval: 44.27ms per token, 22.59 tokens per second
- Eval: 27.92 ms per token, 35.82 tokens per second
- 150 tokens in 5.18 seconds: 28.95 tokens per second

So you’re actually doing better than I’d expect an M2 Max to do.

fallingdowndizzyvr · 1 year ago

You don’t say what quant you are using, if any. But on Q4K_M I get this on my M1 Max using pure llama.cpp.

llama_print_timings: prompt eval time = 246.97 ms / 10 tokens ( 24.70 ms per token, 40.49 tokens per second)

llama_print_timings: eval time = 28466.45 ms / 683 runs ( 41.68 ms per token, 23.99 tokens per second)

Your M3 has lower memory bandwidth than my M1. It’s the 300GB/s version versus 400GB/s.