Non-coding 34b models- anyone had much success?

SomeOddCodeGuy · 1 year ago

Wow, I’ve never seen an fp16 gguf before. Holy crap, I wish there were more of those out there; I’d love to get my hands on some for 70b models or the like. I didn’t realize unquantized gguf was an option

SomeOddCodeGuy · 1 year ago

I have a mac studio as my main inference machine.

My opinion? RAM and Bandwidth > all, IMO, Personally, I would pick A as it’s the perfect in-between. At 64GB of RAM you should have around 48GB or so of usable VRAM without any kernel/sudo shenanigans (Im excited to try some of the recommendations folks have given here lately to change that), and you get the 400GB/s bandwidth.

My Mac Studio has 800GB/s bandwidth, and I can run 70b q8 models… but at full context, it requires a bit of patience. I imagine a 70b would be beyond frustrating at 300GB/s bandwidth. While the 96GB model could run a 70b q8… I don’t really know that I’d want to, if I’m being honest.

My personal view is that on a laptop like that, I’d want to max out on the 34b models, as those are very powerful and would still run at a decent speed on the laptop’s bandwidth. So if all I was planning to run was 34b models, a 34b q8 with 16k context would fit cleanly into 48GB and I’d earn an extra 100GB/s of bandwidth for the choice.

SomeOddCodeGuy · 1 year ago

I’ve seen a couple of YARN models, but I honestly have no idea how to use them lol. That and the mistral models; they always want to load up at 32k tokens, but then coherency of the model just dies after 5k. I can’t find really clear instructions on what’s expected to get maximum context value from either, so I tend to just ignore using either at high context.

SomeOddCodeGuy · 1 year ago

Awesome! I think I remember us talking about this at some point, but I didn’t have the courage to try it on my own machine. You’re the first person I’ve seen actually do the deed, and now I want to as well =D The 192GB Mac Studio stops at 147GB… I also run headless, so I can’t fathom that stupid bricks really needs 45GB of RAM to do normal stuff lol.

I am inspired. I’ll give it a go this weekend! Great work =D

SomeOddCodeGuy · 1 year ago

This little bit right here is very important if you want to do work regularly with an AI

Specifying a role when prompting can effectively improve the performance of LLMs by at least 20% compared with the control prompt, where no context is given. Such a result suggests that adding a social role in the prompt could benefit LLMs by a large margin.

I remembered seeing an article about this a few months back, which lead to my working on an Assistant prompt, and it’s been hugely helpful.

I imagine this comes down to how Generative AI works under the hood. It ingested tons of books, tutorials, posts, etc from people who identified as certain things. Telling it to also identify as that thing could open a lot of pieces of information to it that it wouldn’t otherwise be looking at.

I always recommend that folks set up roles for their AI when working with it, because the results I’ve personally seen have been miles better when you do.

SomeOddCodeGuy · 1 year ago

Right. This part right here is very suspicious to me, and I’m taking their claims with a grain of salt.

No! The model is not going to be available publically. APOLOGIES. The model like this can be misused very easily. The model is only going to be provided to already selected organisations.

SomeOddCodeGuy · 1 year ago

The results for the 120b continue to absolutely floor me. Not only is it performing that well at 3bpw, but it’s an exl2 as well, which your own tests have shown perform worse than gguf. So imagine what a q4 gguf does if a q3 equivalent exl2 can do this.

SomeOddCodeGuy · 1 year ago

I imagine there’s a lot of work to do so, but I can’t imagine it’s impossible. Probably just not something folks are working on.

I don’t particularly mind too much, because the quality difference between exl2 and gguf is hard for me to work past. Just last night I was trying to run this NeuralChat 7b everyone is talking about on my windows machine in 8bpw exl2, and it was SUPER fast, but the model was so easily confused; before giving up on it, I grabbed the q8 gguf and swapped to it (with no other changes) and suddenly saw why everyone was saying that model is so good.

I don’t mind speed loss if I get quality, but I can’t handle quality loss to get speed. So for now, I really don’t mind only using gguf, because it’s perfect for me.

SomeOddCodeGuy · 1 year ago

M2 Ultra user here. I threw some numbers up for token counts: https://www.reddit.com/r/LocalLLaMA/comments/183bqei/comment/kaqf2j0/?context=3

Does a big memory let you increase the context length with smaller models where the parameters don’t fill the memory?

With the 147GB of VRAM I have available, I’m pretty sure I could use all 200k tokens available in a Yi 34b model, but I’d be waiting half an hour for a result. I’ve done up to 50k in CodeLlama, and it took a solid 10 minutes to get a response.

The M2 Ultra’s big draw is its big RAM; its not worth it unless you get the 128GB model or higher. You have to understand that the speed of the M2 ultra doesn’t remotely compare to something like a 4090; CUDA cards are gonna leave us in the dust.

Another thing to consider is that we can only use ggufs via Llamacpp; there’s no support for anything else. In that regard, I’ve seen people put together 3x or more Tesla P40 builds that have the exact same limitation (can only use Llamacpp) but cost half the price or less.

I chose the M2 Ultra because it was easy. Big VRAM, and it took me less than 30 minutes from the moment I got the box to be chatting to a 70b q8 on it. But if speed or price are a major consideration, moreso than level of effort to set up? In that case the M2 ultra would not be the answer.

SomeOddCodeGuy · 1 year ago

TheBloke just quantized his newest version of this model. I’m downloading it right now =D

But I’m with you- Capybara-Tess-Yi is amazing; I don’t RP so I can’t speak to that, but for a conversational model that does basica ChatGPT tasks? It’s amazing.

SomeOddCodeGuy · 1 year ago

Hmm… I’m afraid I personally am not sure on the answer of that, though I do recommend checking out these tests, as Wolfram does tests where the models do stuff back and forth between German and English.

https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_comparisontest_2x_34b_yi_dolphin_nous/

SomeOddCodeGuy · 1 year ago

The most multi-lingual capable model I’m aware of is OpenBuddy 70b. I use it as a foreign language tutor, and it does an ok job. I constantly check it against google translate, and it hasn’t let me down yet, but ymmv. I don’t use it a ton.

I think the problem is that, in general, technology hasn’t been the best at foreign language translations. Google Translate is SOTA in that realm, and it’s not perfect. I’m not sure I’d trust it for doing this in a real production sense, but I do trust it enough to help me learn just enough to get by.

So with that said, you could likely get halfway far mixing any LLM with a handful of tools. For example- SillyTavern I believe has a Google Translate module built in. You could use Google to do the translations. Then, having multiple speech to text/text to speech modules, one for each language, might give you that flexibility of input and output.

Essentially, I would imagine that 90% of the work will be developing tooling around any decent LLM, regardless of its language abilities, and then using external tooling to support that. I could be wrong, though.

SomeOddCodeGuy · 1 year ago

I’ll be interested to see what responses you get, but I’m gonna come out and say that the Mac’s power is NOT its speed. Pound for pound, a CUDA video card is going to absolutely leave our machines in the dust.

So, with that said- I actually think your 20 tokens a second is kind of great. I mean- my M2 Ultra is two M2 Max processors stacked on top of each other, and I get the following for Mythomax-l2-13b:

Llama.cpp directly:
- Prompt eval: 17.79ms per token, 56.22 tokens per second
- Eval: 28.27ms per token, 35.38 tokens per second
- 565 tokens in 15.86 seconds: 35.6 tokens per second
Llama cpp python in Oobabooga:
- Prompt eval: 44.27ms per token, 22.59 tokens per second
- Eval: 27.92 ms per token, 35.82 tokens per second
- 150 tokens in 5.18 seconds: 28.95 tokens per second

So you’re actually doing better than I’d expect an M2 Max to do.

SomeOddCodeGuy · 1 year ago

I started using the Tess Yi 34b 200k the other day and it’s been fantastic. Of the 34bs I’ve tried so far, I’m really enjoying it the most for an AI assistant. I definitely recommend toying around with it.

If you do like me and use a Vicuna instruct, make sure to put a custom stop string of “”

SomeOddCodeGuy · 1 year ago

I have yet to see anything dethrone Mythomax for the 13b range for conversations. Lots of things become the new shiny, then they fade away… but Mythomax continues to live on.

SomeOddCodeGuy · 1 year ago

Just wanted to come back and let you know I started using this last night, and this is fantastic. I haven’t put it through much testing yet, but just know that on initial use I’m very impressed by this model for general purpose AI assistant. It’s keeping to the Assistant’s more informal speech patterns while also answering questions well and keeping up with large context. Those are 3 checkboxes I’ve never been able to check at once. This praise wont’ get much visibility since it’s an older thread, but just wanted to let you know at least.

SomeOddCodeGuy · 1 year ago

I wish there was support for metal with ExLlamav2. :(

SomeOddCodeGuy · 1 year ago

So I got very interested in this, and I grilled my AI Assistant (custom char card that I built to help debug this very type of thing) on it, starting with using Goliath 120b q8.

Note: Nothing I’m about to write may be of any value to anyone, anywhere, but I had fun doing it so… yay. lol

I gave her your list and asked her to sort it, and she got it all wrong. So then I started asking for help on understanding why, and we went back and forth a bit. I gathered a couple of tidbits:

The LLM sees the numbers properly. There was no confusion as to what the numbers are. When asked to write the values of the numbers, it was able to do that with no problems.
The LLM is confused by the volume of information in the request. “Item A1 - 56” is actually a lot of information for the LLM, and my assistant said she was really struggling during the sorting process to break down that info. The other information was simply getting in the way.
Giving it only the numbers, with no other data, had much better results but was still wrong; just less so. The result was: [14, 27, 38, 45, 50, 56, 63, 72, 81, 89, 94, 11, 32, 78].
I then swapped to Synthia 70b 1.5 q8 to grill her on why the number only sort might have gone wrong. The response was thus:

Although I am designed to understand language and provide helpful replies, I am not programmed to keep track of numbers, perform calculations, or execute sorting tasks in the same way a human or specialized software would.

The core problem lies in the limitations of Generative AI technology itself. While I can understand the context and meaning of your words, I’m not programmed with mathematical or logic-based algorithms that involve numbers or sorting operations.

In this specific example, I was able to identify and sort the first nine numbers correctly. But when providing the final sorted list, I had inadvertently forgotten to incorporate the remaining numbers (“11, 32, 78”) at their appropriate positions, which led to the incorrect final result.

Several times in our conversation, she blamed the error on transcription, so either she’s hallucinating up a storm (totally possible, given the complexity of the answer), or there’s something about the way generative AI writes/reads the prompt that causes it to not track all the numbers.

SomeOddCodeGuy · 1 year ago

Non-coding 34b models- anyone had much success?

SomeOddCodeGuy · 1 year ago

Not in the slightest.

Nothing Altman had done made me think of him as someone whose interests aligned with mine in terms of AI. It was under his watch that "Open"AI became not-so-open, after all. And I can’t stand the privacy policy that ChatGPT works under.

It’s not often we see a CEO get fired, like properly “pack your bags, there’s the door” kind of fired, so whatever he did must have been next level silliness.

But in terms of the AI landscape? OpenAI is the biggest player, but it’s doing nothing beneficial for the open source scene. I’d be far more worried if someone like Zuckerberg got ousted, given that its Meta giving us the Llama models.

Whatever the next CEO of OpenAI does, it will likely either be continuing the path of proprietary AI being something unpleasant to deal with if you like privacy or security, or it will get better. In terms of how it affects me? Seems it can only get better from here.

SomeOddCodeGuy · 1 year ago

exl2 has a similar problem. For some reason, even a lower bpw gguf seems to blow away exl2 in terms of quality.

Even EXL2 5.0bpw was surprisingly doing much worse than GGUF Q2_K.

https://www.reddit.com/r/LocalLLaMA/comments/17w57eu/llm_format_comparisonbenchmark_70b_gguf_vs_exl2/

SomeOddCodeGuy · 1 year ago

One of the coolest uses of AI to me is your own personal tutor. Here's me testing a character card I had been toying with. Model is CodeBooga 34b q8

SomeOddCodeGuy · 1 year ago

Clearing up confusion: GPT 3.5-Turbo may not be 20b after all

SomeOddCodeGuy · 1 year ago

So what's the story with Phind CodeLlama?