So. My rig (Ryzen 7 3700x, 64G Ram, RTX3070, Intel Arc 380) can run up to 70B parameter models… but they run at a snails pace. Furthermore, i don’t honestly see that big of an improvement for regular chat task from a 70B parameter model vs a 13B parameter model. Don’t get me wrong… there is an improvement in adherence sometimes, it’s just not a GIANT leap forward as i expected. Especially the 30B ish models. Basically no difference between 30B and 70B. I run everything at Q5.

Here is my question… Would running a 70b at Q2 be better than a 7B or 13B at Q5? Would speed improve?

Also, I notice that Mistral runs faster on my machine even at the same parameter counts than LLAMA models… anyone know why?

I know i could run all these test myself theoretically but there is just so much to test and so little time. I figured I’d ask around and see if someone else did it first.