sanjay303

sanjay303

Hi guys, I am new to LLMs and especially in using them locally. I have done basic stuffs to learn things like RAG using framework/library LangChain on collab and locally on my cpu machine by using quantised models from TheBloke. But now I want to move on development and production stuffs for some of my potential clients. I will have lots of question during this time, but I will start with learning GPU things.

What minimum GPU server required to run a model like Mistral-7B or LLaMA-13B for inference purpose to build a simple RAG application, keeping 8K context length? Basically I have no idea what types of GPU someone should look for different LLM operations. And what one should look for while building any such LLM apps in productions for a small to midsize company?

A quick google search landed me to https://www.gpu-mart.com/gpu-dedicated-server, but I have little to no knowledge to process out this information. Also, I would appreciate if someone can refer me to an up to date guide on deciding servers configuration (GPU plus other things) for these types of LLM app.

what GPU configuration one should look for a Mistral-7B like model for basic inference or RAG LLM app in production?

what GPU configuration one should look for a Mistral-7B like model for basic inference or RAG LLM app in production?