I am talking about this particular model:

https://huggingface.co/TheBloke/goliath-120b-GGUF

I specifically use: goliath-120b.Q4_K_M.gguf

I can run it on runpod.io on this A100 instance with “humane” speed, but it is way too slow for creating long form text.

https://preview.redd.it/fz28iycv860c1.png?width=350&format=png&auto=webp&s=cd034b6fb6fe80f209f5e6d5278206fd714a1b10

These are my settings in text-generation-webui:

https://preview.redd.it/vw53pc33960c1.png?width=833&format=png&auto=webp&s=0fccbeac0994447cf7b7462f65d79f2e8f8f1969

Any advice? Thanks

    • nero10578B
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Wait what? I am getting 2-3t/s on 3x P40 running Goliath GGUF Q4KS.

    • abandonedexplorerOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Thanks. Will try this. No idea how these really work so that is why i am asking :)

    • Worldly-Mistake-8147B
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      I’m sorry for a little side-track, but how much context you able to squeeze into your 3 GPUs with Goliath’s 4bit quant?
      I’m considering to add another 3090 to my own doble-GPU setup just to run this model.

      • panchovixB
        link
        fedilink
        English
        arrow-up
        1
        ·
        10 months ago

        I tested 4K and it worked fine at 4.5bpw. Max will be prob about 6k. I didn’t use 8bit cache

        Now 4.5bpw is kinda overkill, 4.12~ bpw is like 4bit 128g gptq, and that would let you use a lot more context.

          • panchovixB
            link
            fedilink
            English
            arrow-up
            1
            ·
            10 months ago

            Basically for now an X670E mobo + Ryzen 7 7800X3D.

            But if you want the full speed, a server mobo with ton of PCI-E lanes will work better.