hello i am currently trying to set up a rag pipline and i noticed that as my prompts get longer and filled with context the tokens per second decrease drastically. from 10-20 ish down to 2 or less i am using llama.cpp running currently a 4q llama2 7b model on my 3060 laptop with 6gb of vram(using cuda).
I dont understand why this is happening and it makes the responsese painfully slow of course i expect the time it need to process a longer prompt to increase but why is it increasing the time it needs per token.
i would love to hear if this is normal and if no what i might do about it ?
here an exact example:
llm.complete("tell me what is a cat using the following context, here is a bit of context about cats: Female domestic cats can have kittens from spring to late autumn in temperate zones and throughout the year in equatorial regions, with litter sizes often ranging from two to five kittens. Domestic cats are bred and shown at events as registered pedigreed cats, a hobby known as cat fancy. Animal population control of cats may be achieved by spaying and neutering, but their proliferation and the abandonment of pets has resulted in large numbers of feral cats worldwide, contributing to the extinction of bird, mammal and reptile species. ")
llama_print_timings: prompt eval time = 80447.41 ms / 153 tokens ( 525.80 ms per token, 1.90 tokens per second)
compared to
llm.complete(“tell me what is a cat”)
llama_print_timings: prompt eval time = 319.16 ms / 4 tokens ( 79.79 ms per token, 12.53 tokens per second)