I recently found out about Chronos-Hermes 13B and have been trying to play around with it.
I’ve tried three formats of the model, GPTQ, GPML, and GGUF. It’s my understanding that GPML is older and more CPU-based, so I don’t use it much. Whenever I use the GGUF (Q5 version) with KobaldCpp as a backend, I get incredible responses, but the speed is extremely slow. I even offload 32 layers to my GPU, and confirmed that it’s not overusing VRAM, and it’s still slow. The GPTQ model on the other hand is way faster, but the quality of responses is worse.
My question is, are there any tricks to loading GPTQ models I might not be aware of?
exl2 has a similar problem. For some reason, even a lower bpw gguf seems to blow away exl2 in terms of quality.
https://www.reddit.com/r/LocalLLaMA/comments/17w57eu/llm_format_comparisonbenchmark_70b_gguf_vs_exl2/