It’s a new foundational model, so some teething pains are to be expected. Yi is heavily based on (directly copied, for the most part) llama2, but there are just enough differences in the training parameters that default llama2 settings don’t get good results. KCPP has already addressed the rope scaling, and I’m sure it’s only a matter of time before the other issues are hashed out.
- 1 Post
- 19 Comments
70b models will be extremely slow on pure CPU, but you’re welcome to try. There’s no point in looking on “torrent sites” for LLMs - literally everything is hosted on huggingface.
Yes, your GPU is too old to be useful for offloading, but you could still use it for prompt processing acceleration at least.
With your hardware, you want to use koboldCPP. This uses models in GGML/GGUF format. You should have no issue running models up to 120b with that much RAM, but large models will be incredibly slow (like 10+ minutes per response) running on CPU only. Recommend sticking to 13b models unless you’re incredibly patient.
All yi models are extremely picky when it comes to things like prompt format, end string, and rope parameters. You’ll get gibberish from any of them unless you get everything set up just right, at which point they perform very well.
It’s adorable that you think any 13b model is anywhere close to a 70b llama2 model.
candre23BtoData Hoarder@selfhosted.forum•It is time keep hoarding AI models as Chinese censorship hits NYC based Huggingface the biggest AI libraryEnglish1·2 years agoAnywhere from 1 to several hundred GB. Quantized (compressed), the most popular models are 8-40gb each. LORAs are a lot smaller, but full models take up a lot of space.
candre23Bto LocalLLaMA@poweruser.forum•Hardware Q's: Best model performance with 75+ 30 series GPU's?English1·2 years agoNo idea why you would need ~1800GB vram.
Homeboy’s waifu is gonna be THICC.
candre23Bto LocalLLaMA@poweruser.forum•Tesla P40 cards - what cooling solutions work well?English1·2 years agoExtremely effective and definitely the quietest option, but requires a lot of space: https://www.printables.com/model/484282-nvidia-tesla-p40-120mm-blower-fan-adapter-straight
candre23Bto LocalLLaMA@poweruser.forum•Is it worth using a bunch of old GTX 10 series cards ( like 1060 1070 1080 ) for running local LLM?English1·2 years agoThe ONLY pascal card worth bothering with is the P40. It’s not fast, but it’s the cheapest way to get a whole bunch of usable vram. Nothing else from that generation is worth the effort.
candre23Bto LocalLLaMA@poweruser.forum•Sam Altman out as CEO of OpenAI. Mira Murati is the new CEO.English1·2 years agoAnd Brockman just quit. Hell of a shakeup over there.
candre23Bto LocalLLaMA@poweruser.forum•Microsoft announced the Maia 100 AI Accelerator Chip. It's also expanding the use of the AMD MI300 in it's datacenters. Is this the beginning of the end of CUDA dominance?English1·2 years agoIs this the beginning of the end of CUDA dominance?
Not unless intel/AMD/MS/whoever ramps up their software API to the level of efficiency and just-works-edness that cuda provides.
I don’t like nvidia/cuda any more than the next guy, but it’s far and away the best thing going right now. If you have an nvidia card, you can get the best possible AI performance from it with basically zero effort on either windows or linux.
Meanwhile, AMD is either unbearably slow with openCL, or an arduous slog to get rocm working (unless you’re using specific cards on specific linux distros). Intel is limited to openCL at best.
Until some other manufacturer provides something that can legitimately compete with cuda, cuda ain’t going anywhere.
candre23Bto LocalLLaMA@poweruser.forum•🐺🐦⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)English1·2 years agoGGUF I get like tops 4-5 t/s.
You’re doing something very wrong. I get better speeds than that on P40s with low context. Are you not using cublas?
candre23BtoCars@gearhead.town•Volvo CEO Jim Rowan thinks dropping Apple CarPlay is a mistakeEnglish1·2 years agoNobody under 40 is ever going to buy a new car that doesn’t interface with their phone properly. It’s no longer a luxurious or optional extra, it’s a bare fucking minimum requirement.
candre23Bto LocalLLaMA@poweruser.forum•🗺️ Well maintained guide to current state of AI and LLMs, for beginners/ non-tech professionals?English1·2 years agoThe best noob-accessible explanation of LLMs I’ve found so far: https://blog.rfox.eu/en/Programming/How_to_run_your_own_LLM_GPT.html
The most entertaining (IMHO) explanation, which is (at best) 60% accurate: https://www.reddit.com/r/LocalLLaMA/comments/12ld62s/the_state_of_llm_ais_as_explained_by_somebody_who/
candre23Bto LocalLLaMA@poweruser.forum•Comparing 4060 Ti 16GB + DDR5 6000 vs 3090 24GB: looking for 34B model benchmarksEnglish1·2 years agoThe 3090 will outperform the 4060 several times over. It’s not even a competition - it’s a slaughter.
As soon as you have to offload even a single layer to system memory (regardless of the speed), you cut your performance by an order of magnitude. I don’t care if you have screaming fast DDR5 in 8 channels and a pair of the beefiest xeons money can buy, your performance will fall off a cliff the minute you start offloading. If a 3090 is within your budget, that is the unambiguous answer.
candre23OPBtoData Hoarder@selfhosted.forum•SSD shopping - how to find drives with DRAM cache?English1·2 years agoThanks, but I don’t think PCP is terribly accurate when it comes to cache. Narrowing it down to 2.5" drives with at least 1TB and at least 8mb cache only returns 13 results. Just a bunch of samsung drives and the crucial MX500. I know there’s more than that out there.
Kinda buried the lead here. This is far and away the biggest feature of this model. Here’s hoping it’s actually decent as well!