Hi folks,
I’m currently torn between purchasing the MacBook Pro with M3Max 64GB memory and the 128GB version, primarily for running large model inferences locally.
Considering practicality and inference speed, is the 64GB variant the most cost-effective option for deploying most large-scale local models?
Additionally, for those who have experience in this domain, how essential is it to pursue local deployment of models with 70B fp16 or 180B 4bit parameters or even more than that? Is it an overkill for most applications or a worthy venture for future-proofing?
I genuinely appreciate all insights and recommendations. Thanks in advance!
If you’re up to spend a bit more and don’t need a laptop, I would get an M2 Ultra. I have a 96gb MBP M2 and inference can be a little poky, whereas it screams on the Ultra with 192gb. I would say 70b q6 is a good place to be in terms of quality. According to perplexity measurements, the change up from there is pretty minuscule. I haven’t run the 180b parameter models on the Ultra yet as I’d like to do Airoboros and the GGUF isn’t out yet.
If you’re wanting something that is faster than an a6000, or can effectively run more than one model at a time, you’re going to be disappointed here, as I have found that llama.cpp completely maxes out the bandwidth of either of my machines with just one slot running and continuous jobs.