If you’re up to spend a bit more and don’t need a laptop, I would get an M2 Ultra. I have a 96gb MBP M2 and inference can be a little poky, whereas it screams on the Ultra with 192gb. I would say 70b q6 is a good place to be in terms of quality. According to perplexity measurements, the change up from there is pretty minuscule. I haven’t run the 180b parameter models on the Ultra yet as I’d like to do Airoboros and the GGUF isn’t out yet.
If you’re wanting something that is faster than an a6000, or can effectively run more than one model at a time, you’re going to be disappointed here, as I have found that llama.cpp completely maxes out the bandwidth of either of my machines with just one slot running and continuous jobs.
If you’re up to spend a bit more and don’t need a laptop, I would get an M2 Ultra. I have a 96gb MBP M2 and inference can be a little poky, whereas it screams on the Ultra with 192gb. I would say 70b q6 is a good place to be in terms of quality. According to perplexity measurements, the change up from there is pretty minuscule. I haven’t run the 180b parameter models on the Ultra yet as I’d like to do Airoboros and the GGUF isn’t out yet.
If you’re wanting something that is faster than an a6000, or can effectively run more than one model at a time, you’re going to be disappointed here, as I have found that llama.cpp completely maxes out the bandwidth of either of my machines with just one slot running and continuous jobs.