Civil_Ranger4687B to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

Lower quality responses with GPTQ model vs GGUF?

2

1

Lower quality responses with GPTQ model vs GGUF?

Civil_Ranger4687B to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

2

I recently found out about Chronos-Hermes 13B and have been trying to play around with it.

I’ve tried three formats of the model, GPTQ, GPML, and GGUF. It’s my understanding that GPML is older and more CPU-based, so I don’t use it much. Whenever I use the GGUF (Q5 version) with KobaldCpp as a backend, I get incredible responses, but the speed is extremely slow. I even offload 32 layers to my GPU, and confirmed that it’s not overusing VRAM, and it’s still slow. The GPTQ model on the other hand is way faster, but the quality of responses is worse.

My question is, are there any tricks to loading GPTQ models I might not be aware of?

Chat

SomeOddCodeGuyB
link
fedilink
English
arrow-up
1·
1 year ago
exl2 has a similar problem. For some reason, even a lower bpw gguf seems to blow away exl2 in terms of quality.

Even EXL2 5.0bpw was surprisingly doing much worse than GGUF Q2_K.

https://www.reddit.com/r/LocalLLaMA/comments/17w57eu/llm_format_comparisonbenchmark_70b_gguf_vs_exl2/