Hi everyone,
We’ve recently experimented with deploying the CodeLlama 34 Bn model and wanted to share our key findings for those interested:
- Best Performance: Quantized GPTQ, 4-bit CodeLlama-Python-34B model using vLLM.
- Results: Average lowest latency of 3.51 sec, average token generation at 58.40/sec, and a cold start time of 21.8 sec on our platform, using Nvidia A100 GPU.
- Other Libraries Tested: HuggingFace Transformer Pipeline, AutoGPTQ, Text Generation Inference.
Keen to hear your experiences and learnings in similar deployments!
Not yet, made a note. Will add when i update the tutorial.