The title, pretty much.

I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

  • semicausalB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    In my experience, the lower you go…the model:

    - hallucinates more (one time I asked Llama2 what made the sky blue and it freaked out and generated thousands of similar questions line by line)

    - is more likely to give you an inaccurate response when it doesn’t hallucinate

    - is significantly more unreliable and non-deterministic (seriously, providing the same prompt can cause different answers!)

    At the bottom of this post, I compare the 2-bit and 8-bit extreme ends of Code Llama Instruct model with the same prompt and you can see how it played out: https://about.xethub.com/blog/comparing-code-llama-models-locally-macbook