I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x
But when i inference codellama 13b with oobabooga(web ui)
It just make 5tokens/s
It is so slow.
Is there any config or something else for a100???
You must log in or register to comment.
That sounds like CPU speed. What you see from `watch nvidia-smi -d -n 0.1` while you’re running inference ?
Something is wrong with your environment. even P40s give more than that.
Other option is you don’t get enough tokens to get proper t/s speed. What was the total inference time?
Try use GGUF, this format likes single GPU especially you have 80GB vram. I think you can run 70gb GGUF with all layers in GPU.