SSD shopping - how to find drives with DRAM cache?

candre23 · 2 years ago

we have expanded the context window length to 32K

Kinda buried the lead here. This is far and away the biggest feature of this model. Here’s hoping it’s actually decent as well!

candre23 · 2 years ago

It’s a new foundational model, so some teething pains are to be expected. Yi is heavily based on (directly copied, for the most part) llama2, but there are just enough differences in the training parameters that default llama2 settings don’t get good results. KCPP has already addressed the rope scaling, and I’m sure it’s only a matter of time before the other issues are hashed out.

candre23 · 2 years ago

70b models will be extremely slow on pure CPU, but you’re welcome to try. There’s no point in looking on “torrent sites” for LLMs - literally everything is hosted on huggingface.

candre23 · 2 years ago

Yes, your GPU is too old to be useful for offloading, but you could still use it for prompt processing acceleration at least.

With your hardware, you want to use koboldCPP. This uses models in GGML/GGUF format. You should have no issue running models up to 120b with that much RAM, but large models will be incredibly slow (like 10+ minutes per response) running on CPU only. Recommend sticking to 13b models unless you’re incredibly patient.

candre23 · 2 years ago

It’s a jeep thing. You wouldn’t understand.

candre23 · 2 years ago

All yi models are extremely picky when it comes to things like prompt format, end string, and rope parameters. You’ll get gibberish from any of them unless you get everything set up just right, at which point they perform very well.

candre23 · 2 years ago

It’s adorable that you think any 13b model is anywhere close to a 70b llama2 model.

candre23 · 2 years ago

Anywhere from 1 to several hundred GB. Quantized (compressed), the most popular models are 8-40gb each. LORAs are a lot smaller, but full models take up a lot of space.

candre23 · 2 years ago

No idea why you would need ~1800GB vram.

Homeboy’s waifu is gonna be THICC.

candre23 · 2 years ago

Extremely effective and definitely the quietest option, but requires a lot of space: https://www.printables.com/model/484282-nvidia-tesla-p40-120mm-blower-fan-adapter-straight

candre23 · 2 years ago

The ONLY pascal card worth bothering with is the P40. It’s not fast, but it’s the cheapest way to get a whole bunch of usable vram. Nothing else from that generation is worth the effort.

candre23 · 2 years ago

And Brockman just quit. Hell of a shakeup over there.

https://arstechnica.com/information-technology/2023/11/openai-president-greg-brockman-quits-as-nervous-employees-hold-all-hands-meeting/

candre23 · 2 years ago

Is this the beginning of the end of CUDA dominance?

Not unless intel/AMD/MS/whoever ramps up their software API to the level of efficiency and just-works-edness that cuda provides.

I don’t like nvidia/cuda any more than the next guy, but it’s far and away the best thing going right now. If you have an nvidia card, you can get the best possible AI performance from it with basically zero effort on either windows or linux.

Meanwhile, AMD is either unbearably slow with openCL, or an arduous slog to get rocm working (unless you’re using specific cards on specific linux distros). Intel is limited to openCL at best.

Until some other manufacturer provides something that can legitimately compete with cuda, cuda ain’t going anywhere.

candre23 · 2 years ago

GGUF I get like tops 4-5 t/s.

You’re doing something very wrong. I get better speeds than that on P40s with low context. Are you not using cublas?

candre23 · 2 years ago

Nobody under 40 is ever going to buy a new car that doesn’t interface with their phone properly. It’s no longer a luxurious or optional extra, it’s a bare fucking minimum requirement.

candre23 · 2 years ago

Accurate.

candre23 · 2 years ago

The best noob-accessible explanation of LLMs I’ve found so far: https://blog.rfox.eu/en/Programming/How_to_run_your_own_LLM_GPT.html

The most entertaining (IMHO) explanation, which is (at best) 60% accurate: https://www.reddit.com/r/LocalLLaMA/comments/12ld62s/the_state_of_llm_ais_as_explained_by_somebody_who/

candre23 · 2 years ago

The 3090 will outperform the 4060 several times over. It’s not even a competition - it’s a slaughter.

As soon as you have to offload even a single layer to system memory (regardless of the speed), you cut your performance by an order of magnitude. I don’t care if you have screaming fast DDR5 in 8 channels and a pair of the beefiest xeons money can buy, your performance will fall off a cliff the minute you start offloading. If a 3090 is within your budget, that is the unambiguous answer.

candre23 · 2 years ago

Thanks, but I don’t think PCP is terribly accurate when it comes to cache. Narrowing it down to 2.5" drives with at least 1TB and at least 8mb cache only returns 13 results. Just a bunch of samsung drives and the crucial MX500. I know there’s more than that out there.

candre23 · 2 years ago

SSD shopping - how to find drives with DRAM cache?