I get 20 t/s with a 70B 2.5bpw model, but this is only 47% of the theoretical maximum of 3090.

In comparison, the benchmarks on the exl2 github homepage show 35 t/s, which is 76% the theoretical maximum of 4090.

The bandwidth differences between the two GPUs aren’t huge, 4090 is only 7-8% higher.

Why? Does anyone else have a similar 20 t/s ? I don’t think my cpu performance is the issue.

The benchmarks also show ~85% utilization on 34B on 4bpw (normal models)

  • AaaaaaaaaeeeeeOPB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    8k with 2.4bpw and 20 t/s, the vram usage says 23.85/24.00 gb.

    16k with 2.4bpw 20 t/s with fp8 cache

    I have 0.5-0.6gb used for driving the monitor graphics on ubuntu.

    Did you disable the nvidia system memory fallback that they pushed on Windows users? That’s probably what you need.