Cranking the performance on M3 Max

blackstonewine · 1 year ago

Cranking the performance on M3 Max

fallingdowndizzyvr · 1 year ago

You don’t say what quant you are using, if any. But on Q4K_M I get this on my M1 Max using pure llama.cpp.

llama_print_timings: prompt eval time = 246.97 ms / 10 tokens ( 24.70 ms per token, 40.49 tokens per second)

llama_print_timings: eval time = 28466.45 ms / 683 runs ( 41.68 ms per token, 23.99 tokens per second)

Your M3 has lower memory bandwidth than my M1. It’s the 300GB/s version versus 400GB/s.