The title, pretty much.
I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.
The title, pretty much.
I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.
Early research suggested that there was an inflection point below 4-bits, where things got markedly worse. In my personal use, I find that accuracy definitely suffers below there, though maybe modern quants are a bit better at it.
34B Yi does seem like a sweet spot, though I’m starting to suspect that we need some fine-tunes that use longer stories as part of the training data, because it doesn’t seem to be able to maintain the quality for the entire length of the context. Still, being able to include callbacks to events from thousands of tokens earlier is impressively practical. I’ve been alternating between a fast 13B (for specific scenes), 34B Yi (for general writing), and 70B (for when you need it to be smart and varied). And, of course, just switching models can help with the repetition sometimes.