Need help estimating if my speed is expected. Llama_index

Noxusequal · 1 year ago

harrro · 1 year ago

After the document/PDF is already indexed, generating a 256 token answer should take a few seconds (assuming you’re using a 7-13B model).

Check that CUDA is being used (check your video card’s RAM usage to see if the model is loaded into VRAM).

Noxusequal · 1 year ago

I know that cuda is used vram is full and i get the message in the beginning. What is your hardware setup ?

Do you also use llama_index and then langchain or did you build it more or less from llama_cpp and langchain without llama_index ?

harrro · 1 year ago

I’m using langchain with qdrant as the vector store.

VRAM is full

How is a 7B model maxing out your VRAM? A 7B model at 4bit and 4k context should not use the 12GB VRAM on a 3060.

Noxusequal · 1 year ago

Its a 3060 laptop so only 6gb and model plus embedding etc. Is at like 5.8gb