Beginner question: Is there any way to use quantized gguf models in python under windows since auto gptq doesnt work ? I swear i tired searching but did not find an answer.

NoxusequalB to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

Hello everyone i am currently trying to set up a small 7b llama 2 chat model. The unquantized full version runs but only very slowely in pytorch with cuda. I have an rtx 3060 laptop with 16gb of ram. The model is taking about 5 -8 min to reply to the example prompt given

I liked “Breaking Bad” and “Band of Brothers”. Do you have any recommendations of other shows I might like?

and using kobold.cpp running on the llama-2-7b-chat.Q5_K_M.gguf it takes literall seconds. But i found no way to load those quantized modells in pytorch under windows where auto gptq doesnt work. Also is pytorch just alot slower then kobold.cpp ?

You must log in or register to comment.

Chat