I’ve used most of the high-end models in an unquantized format at some point or another (Xwin, Euryale, etc.) and found them generally pretty good experiences, but always seem to lack the ability to “show, not tell” in a way that a strong writer knows how to do, even when prompted to do so. At the same time, I’ve always been rather dissatisfied with a lot of quantizations, as I’ve found the degradation in quality to be rather noticeable. So up until now, I’ve been running unquantized models in 2x a100s and extending the context as far as I’m able to get away with.

Tried Goliath-120b the other day, and this absolutely stood everything on its head. Not only is it capable of stunning levels of writing and implying far more than directly stating in a way I’ve not sure I’ve seen in a model to date, but the exl quants from panchovix to get it to run in a single A100 at 9-10k extended context (about where RoPE scaling seems to universally start to break down in my experience). Best part is, if there is a quality drop (I’m using 4.85 bpw) I’m not seeing it - at all. So not only is it giving a better experience than an unquantized 70b model, but it’s doing so at about half the cost of my usual way of running these models.

Benchmarks be damned, for those willing to rent an A100 for their writing, however this model was managed I think this might be the actual way to challenge the big closed source/censored LLMs for roleplay.

  • a_beautiful_rhindB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Hopefully someone makes some bigger GGUF than Q2. I’ve got 1/2 P40s and 1/2 3090s so can’t use EXL for a model this big.

  • ArtifartXB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    What service do you use for GPU rental and inference for it?

  • Monkey_1505B
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Unfortunately this is beyond the edge of what can reasonably be run on consumer hardware so unlikely to be easily available to most people. Hell, a 70b really requires two graphics cards or a high end mac mini already. If it can’t run on that kinda gear, it’s probably not going to be on ai horde or any API either. Which means you have to use runpod or something - most people are not going to do that.

    • ttkciarB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Nah, if you’re willing to tolerate CPU inference this is achievable for downright cheap.

  • BalorNGB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Can we have some non-cherry-picked examples of writing?

    Does not have to be highly nsfw/whatever, but a comparison of goliath writing compared to output from constituent models at same settings and same (well-crafted) prompts will be very interesting to see, and preferably at least 3 examples per model due to inherent randomness of model output…

    If you say this is “night and day” difference, it should be apparent… I’m not sceptical per se, but “writing quality” is highly subjective and the model style may simply mesh better with your personal preferences?

  • multiverse_fanB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Cool, sounds like a good model to download and store for future when I can get access to better hardware.