I’m looking to get an M3 Max, for reasons other than inference, but would like to factor in for local LLMs and would really appreciate some thoughts on some configurations. The models I’m looking at are:

- a) M3 Max 16" 64GB with 16/CPU, 40/GPU, 16/NE, 400GBs memory bandwidth

- b) M3 Max 16" 96GB with 14/CPU, 40/GPU, 16/NE, 300GBs memory bandwidth

- c) M3 Max 14" 96GB with 14/CPU, 40/GPU, 16/NE, 300GBs memory bandwidth

(moving up to the 14" and 16" 128GB options with 15/40/16 I want to keep off the table, price point wise.)My sense is the main LLM tradeoff comes down to ram and bandwidth, with ram dictating what models can effectively be loaded and bandwidth dictating speed and token/ps. My intuition, very possibly wrong, is that if responsiveness or good interactivity isn’t the main factor, I can prefer ram over bandwidth. That said, I would like to use 70B models if possible, but I’m also unclear whether 64GB ram can hoist 70B models, just heavily quantized ones, or none at all.

I did see a few posts suggesting that’s possible but not specifically for the M3 line of configs above (apologies if I missed them):

- M2 Max

- M3 Max, 128GB:

- M3vsM2 Max

- LLM Performance on M3 Max

The main non-LLM factor is a larger screen and the default choice absent LLMs is option a since 64GB covers other workloads, but wanting headroom for 70B or so LLMs is leaning me to option b which trades up on ram and down on bandwidth, since interactivity is (probably) less important atm than model size, but I’m aware I might have built up some bad assumptions.

  • SomeOddCodeGuyB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I have a mac studio as my main inference machine.

    My opinion? RAM and Bandwidth > all, IMO, Personally, I would pick A as it’s the perfect in-between. At 64GB of RAM you should have around 48GB or so of usable VRAM without any kernel/sudo shenanigans (Im excited to try some of the recommendations folks have given here lately to change that), and you get the 400GB/s bandwidth.

    My Mac Studio has 800GB/s bandwidth, and I can run 70b q8 models… but at full context, it requires a bit of patience. I imagine a 70b would be beyond frustrating at 300GB/s bandwidth. While the 96GB model could run a 70b q8… I don’t really know that I’d want to, if I’m being honest.

    My personal view is that on a laptop like that, I’d want to max out on the 34b models, as those are very powerful and would still run at a decent speed on the laptop’s bandwidth. So if all I was planning to run was 34b models, a 34b q8 with 16k context would fit cleanly into 48GB and I’d earn an extra 100GB/s of bandwidth for the choice.