Right now it seems we are once again on the cusp of another round of LLM size upgrades. It appears to me that having 24gb VRAM gets you access to a lot of really great models, but 48gb VRAM really opens the door towards the impressive 70B models and allows you to nicely run the 30B models. However, im seeing more and more 100B+ models being created that push the 48 gb VRAM specs down into lower quants if they are able to run the model at all.
this is in my opinion is big, because 48gb is currently the magic number for in my opinion consumer level cards, 2x 3090’s or 2x 4090s. adding an extra 24gb to a build via consumer GPUs turns into a monumental task due to either space in the tower or capabilities of the hardware AND it would put you at 72gb VRAM putting you at the very edge of the recommended VRAM for the 120GB 4KM models.
I genuinely don’t know what i am talking about and i am just rambling, because i am trying to wrap my head around HOW to upgrade my vram to load the larger models without buying a massively overpriced workstation card. should i stuff 4 3090’s into a large tower? settle up 3 4090’s in a rig?
how can the average hobbyist make the jump from 48gb to 72gb+?
is taking the wait and see approach towards nvidia dropping new scalper priced high VRAM cards feasible? Hope and pray for some kind of technical magic that drops the required VRAM while simultaneously keeping quality?
the reason i am stressing about this and asking for advice is because the quality difference between smaller models and 70B models is astronomical. and the difference between the 70B models and the 100+B models is a HUGE jump too. from my testing it seems that the 100B+ models really turn the “humanization” of the LLM up to the next level, leaving the 70B models to sound like…well… AI.
I am very curious to see where this gets to by the end of 2024, but for sure… i won’t be seeing it on a 48gb VRAM set up.
This post is an automated archive from a submission made on /r/LocalLLaMA, powered by Fediverser software running on alien.top. Responses to this submission will not be seen by the original author until they claim ownership of their alien.top account. Please consider reaching out to them let them know about this post and help them migrate to Lemmy.
Lemmy users: you are still very much encouraged to participate in the discussion. There are still many other subscribers on !localllama@poweruser.forum that can benefit from your contribution and join in the conversation.
Reddit users: you can also join the fediverse right away by getting by visiting https://portal.alien.top. If you are looking for a Reddit alternative made for and by an independent community, check out Fediverser.
Or 2 a6000s. But yea $$$ matters.
I think its work remembering that while the really big models take a lot of VRAM, they also quantize down to smaller sizes, so the numbers are slightly misleading.
The easiest thing to do is to get a Mac Studio. It also happens to be the best value. 3x4090s at $1600 each is $4800. That’s just for the cards. Adding a machine to put those cards into will cost another few hundred dollars. Just the cost of 3x4090s put you into Mac Ultra 128GB range. Adding the machine to put those cards into puts you in Mac Ultra 192GB range. With those 3x4090s you only have 72GB of RAM. Both those Mac options give you much more RAM.
I’m not getting a super huge jump with the bigger models yet. Just a mild bump. I got a P100 to load the low 100s and have exllama work. That’s 64g of FP16 using vram.
For bigger I can use FP32 and put back the 2 more P40s. That’s 120g of vram. Also 6 vidya cards :P
It required building for this type of system from the start. I’m not made of money either, I just upgrade it over time.
If Nvidia isn’t upgrading GPU’s past 24GB for the RTX 50 series then that will probably factor into the open source community keeping models below 40b parameters. I don’t know the exact cutoff point. A lot of people with 12gb VRAM can run 13b models but you could also run 7b 8-bit with 16k context size. It will get increasingly difficult to run larger contexts with larger models.
Some larger open models are being released but there won’t be much community there to train on a bunch of datasets to the huge models to nail the ideal finetune.
You don’t NEED 3090/4090s. A 3x Tesla P40 setup still streams at reading speed running 120b models.
I have two questions:
what’s this going to look like in six months, with new Intel, AMD, ARM/RISC UMA, hybrid designs well supported and 7200mt+ DDR5 common?
Are the high memory models that much better? My impression is you get a lot of reliable utility out of good smaller models, from there it’s diminishing returns.
I had a honking system with two 3090s, but it felt a bit boondoggle-ish, I sold it and my current plan is to get something like a 4060ti-16gb and also use OpenAI’s API, so I can wait to see what develops, rather than spending it all now while it’s still early days. I can see how someone who is really developing LLMs would want more, but as a “consumer” this seems reasonable.
Even for the “just get a Mac studio,” it seems like the M3 can use more VRAM and is more optimized, so worth it to wait until the M3 Ultra comes out, unless you can get a bargain bin previous model.
Parts wise, a threadripper + ASUS Pro WS WRX80E-SAGE SE WiFi II is already a 2k price floor.
each 4090 is 2-2.3k
each 3090 is 1-1.5k
so building a machine from scratch will run you easily 8-10k off 4090’s and 6-8k off 3090’s. If you already have some GPUS or parts you would still problaly need 2 or more extra gpu’s plus the space and power to run them.
to my specific situation i would have to grab the treadripper, mobo, a case, ram, 2 more cards im looking at potentially 5-7k worth of damage. OR… pay 8.6k for a mac pro m2 and get an entire extra machine to play with.
There’s definitely an entire Mac Pro M3 series on the way considering they just released the laptops, it’s only a matter of time for them to shoot out the announcements. So i would definitely feel a bit peeved if i bought the M2 tower only for a month or two later apple to release the m3 versions.