Hello everyone i am currently trying to set up a small 7b llama 2 chat model. The unquantized full version runs but only very slowely in pytorch with cuda. I have an rtx 3060 laptop with 16gb of ram. The model is taking about 5 -8 min to reply to the example prompt given

I liked “Breaking Bad” and “Band of Brothers”. Do you have any recommendations of other shows I might like?

and using kobold.cpp running on the llama-2-7b-chat.Q5_K_M.gguf it takes literall seconds. But i found no way to load those quantized modells in pytorch under windows where auto gptq doesnt work. Also is pytorch just alot slower then kobold.cpp ?