I’m using vLLM because it’s a drop in replacement for ChatGPT. If there is something else compatible with the ChatGPT API, let me know.

Problem 1: I cannot get anything over a 7B to run in vLLM. I’m sure my parameters are wrong, but I cannot find any documentation.

python3 -m vllm.entrypoints.openai.api_server --model /home/h/Mistral-7B-finetuned-orca-dpo-v2-AWQ --quantization awq --dtype auto --max-model-len 5000

Problem 2: Mistral-7B-finetuned-orca-dpo-v2-AWQ is the only one I got up and running with responses that make sense. However, there is a prompt being appended to everything I send to it:

### Human: Got any creative ideas for a 10 year old’s birthday?
### Assistant: Of course! Here are some creative ideas for a 10-year-old's birthday party: ... [It goes on quite a bit.]

Either because of that or for other reasons it is not answering very basic questions. There are several threads about this on Github, but was able to identify zero actionable information.

Problem 4: CodeLlama-13B-Python-AWQ just blasted a bunch of hastags and gobbledygook back at me. Same problem with the prompt too.

I am running this on an Ubuntu Server VM (16 cores/48gb RAM) right now so I don’t take up any VRAM, but I can switch to Windows if necessary.