NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM

rihard7854 · 1 year ago

a_beautiful_rhind · 1 year ago

70b with 2048 context and 128 reply is about 303 t/s.

That sounds more reasonable. And assuming they aren’t quantized. The batch size is just theoretical batch I think.