the question is is that still faster then system memory or not ?
the question is is that still faster then system memory or not ?
Its a 3060 laptop so only 6gb and model plus embedding etc. Is at like 5.8gb
I know that cuda is used vram is full and i get the message in the beginning. What is your hardware setup ?
Do you also use llama_index and then langchain or did you build it more or less from llama_cpp and langchain without llama_index ?
Okay its working now i need to install nvcc seperatly and change the CUDA_HOME evironment. Also to install nvcc i needed to get the simlinks working manually. but with 15 minutes of google search i got it to work :D thank you all :)
jup that was part of it :) its working now tahnk you
iam looking to do something similar using RAG piplines might be usefull as far as i understand to give the model extra context about the sides you want to summarize.
https://agi-sphere.com/retrieval-augmented-generation-llama2/
maybe you already know all this but i am also new and just recently stumbled upon this :)
I did this :) should have specified but when reinstalling i set both flags as env variables again.
okay thank you guys so this only really makes sense if i want to run different models on the different gpus or if i have something so big i need the 48gb of vram for and i can deal with the slower speeds :) thanks for the feedback