• mcmoose1900B
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

     I don’t know of a model that fits in a 3090 and takes that much time to inference on

    Yi-34B-200K is the base model I’m using. Specifically the Capybara/Tess tunes.

    I can squeeze 63K context on it at 3.5bpw. Its actually surprisingly good at continuing a full context story, referencing details throughout and such.

    Anyway I am on linux, so no gpu swap like windows. I am indeed using it in a chat/novel style chat, so the context does scroll and get cached in ooba.