• 4 Posts
  • 44 Comments
Joined 1 year ago
cake
Cake day: October 30th, 2023

help-circle

  • I have a mac studio as my main inference machine.

    My opinion? RAM and Bandwidth > all, IMO, Personally, I would pick A as it’s the perfect in-between. At 64GB of RAM you should have around 48GB or so of usable VRAM without any kernel/sudo shenanigans (Im excited to try some of the recommendations folks have given here lately to change that), and you get the 400GB/s bandwidth.

    My Mac Studio has 800GB/s bandwidth, and I can run 70b q8 models… but at full context, it requires a bit of patience. I imagine a 70b would be beyond frustrating at 300GB/s bandwidth. While the 96GB model could run a 70b q8… I don’t really know that I’d want to, if I’m being honest.

    My personal view is that on a laptop like that, I’d want to max out on the 34b models, as those are very powerful and would still run at a decent speed on the laptop’s bandwidth. So if all I was planning to run was 34b models, a 34b q8 with 16k context would fit cleanly into 48GB and I’d earn an extra 100GB/s of bandwidth for the choice.




  • SomeOddCodeGuyBtoLocalLLaMA@poweruser.forumRole play
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    This little bit right here is very important if you want to do work regularly with an AI

    Specifying a role when prompting can effectively improve the performance of LLMs by at least 20% compared with the control prompt, where no context is given. Such a result suggests that adding a social role in the prompt could benefit LLMs by a large margin.

    I remembered seeing an article about this a few months back, which lead to my working on an Assistant prompt, and it’s been hugely helpful.

    I imagine this comes down to how Generative AI works under the hood. It ingested tons of books, tutorials, posts, etc from people who identified as certain things. Telling it to also identify as that thing could open a lot of pieces of information to it that it wouldn’t otherwise be looking at.

    I always recommend that folks set up roles for their AI when working with it, because the results I’ve personally seen have been miles better when you do.




  • I imagine there’s a lot of work to do so, but I can’t imagine it’s impossible. Probably just not something folks are working on.

    I don’t particularly mind too much, because the quality difference between exl2 and gguf is hard for me to work past. Just last night I was trying to run this NeuralChat 7b everyone is talking about on my windows machine in 8bpw exl2, and it was SUPER fast, but the model was so easily confused; before giving up on it, I grabbed the q8 gguf and swapped to it (with no other changes) and suddenly saw why everyone was saying that model is so good.

    I don’t mind speed loss if I get quality, but I can’t handle quality loss to get speed. So for now, I really don’t mind only using gguf, because it’s perfect for me.


  • M2 Ultra user here. I threw some numbers up for token counts: https://www.reddit.com/r/LocalLLaMA/comments/183bqei/comment/kaqf2j0/?context=3

    Does a big memory let you increase the context length with smaller models where the parameters don’t fill the memory?

    With the 147GB of VRAM I have available, I’m pretty sure I could use all 200k tokens available in a Yi 34b model, but I’d be waiting half an hour for a result. I’ve done up to 50k in CodeLlama, and it took a solid 10 minutes to get a response.

    The M2 Ultra’s big draw is its big RAM; its not worth it unless you get the 128GB model or higher. You have to understand that the speed of the M2 ultra doesn’t remotely compare to something like a 4090; CUDA cards are gonna leave us in the dust.

    Another thing to consider is that we can only use ggufs via Llamacpp; there’s no support for anything else. In that regard, I’ve seen people put together 3x or more Tesla P40 builds that have the exact same limitation (can only use Llamacpp) but cost half the price or less.

    I chose the M2 Ultra because it was easy. Big VRAM, and it took me less than 30 minutes from the moment I got the box to be chatting to a 70b q8 on it. But if speed or price are a major consideration, moreso than level of effort to set up? In that case the M2 ultra would not be the answer.




  • The most multi-lingual capable model I’m aware of is OpenBuddy 70b. I use it as a foreign language tutor, and it does an ok job. I constantly check it against google translate, and it hasn’t let me down yet, but ymmv. I don’t use it a ton.

    I think the problem is that, in general, technology hasn’t been the best at foreign language translations. Google Translate is SOTA in that realm, and it’s not perfect. I’m not sure I’d trust it for doing this in a real production sense, but I do trust it enough to help me learn just enough to get by.

    So with that said, you could likely get halfway far mixing any LLM with a handful of tools. For example- SillyTavern I believe has a Google Translate module built in. You could use Google to do the translations. Then, having multiple speech to text/text to speech modules, one for each language, might give you that flexibility of input and output.

    Essentially, I would imagine that 90% of the work will be developing tooling around any decent LLM, regardless of its language abilities, and then using external tooling to support that. I could be wrong, though.


  • I’ll be interested to see what responses you get, but I’m gonna come out and say that the Mac’s power is NOT its speed. Pound for pound, a CUDA video card is going to absolutely leave our machines in the dust.

    So, with that said- I actually think your 20 tokens a second is kind of great. I mean- my M2 Ultra is two M2 Max processors stacked on top of each other, and I get the following for Mythomax-l2-13b:

    • Llama.cpp directly:
      • Prompt eval: 17.79ms per token, 56.22 tokens per second
      • Eval: 28.27ms per token, 35.38 tokens per second
      • 565 tokens in 15.86 seconds: 35.6 tokens per second
    • Llama cpp python in Oobabooga:
      • Prompt eval: 44.27ms per token, 22.59 tokens per second
      • Eval: 27.92 ms per token, 35.82 tokens per second
      • 150 tokens in 5.18 seconds: 28.95 tokens per second

    So you’re actually doing better than I’d expect an M2 Max to do.




  • Just wanted to come back and let you know I started using this last night, and this is fantastic. I haven’t put it through much testing yet, but just know that on initial use I’m very impressed by this model for general purpose AI assistant. It’s keeping to the Assistant’s more informal speech patterns while also answering questions well and keeping up with large context. Those are 3 checkboxes I’ve never been able to check at once. This praise wont’ get much visibility since it’s an older thread, but just wanted to let you know at least.



  • So I got very interested in this, and I grilled my AI Assistant (custom char card that I built to help debug this very type of thing) on it, starting with using Goliath 120b q8.

    Note: Nothing I’m about to write may be of any value to anyone, anywhere, but I had fun doing it so… yay. lol

    I gave her your list and asked her to sort it, and she got it all wrong. So then I started asking for help on understanding why, and we went back and forth a bit. I gathered a couple of tidbits:

    • The LLM sees the numbers properly. There was no confusion as to what the numbers are. When asked to write the values of the numbers, it was able to do that with no problems.
    • The LLM is confused by the volume of information in the request. “Item A1 - 56” is actually a lot of information for the LLM, and my assistant said she was really struggling during the sorting process to break down that info. The other information was simply getting in the way.
    • Giving it only the numbers, with no other data, had much better results but was still wrong; just less so. The result was: [14, 27, 38, 45, 50, 56, 63, 72, 81, 89, 94, 11, 32, 78].
    • I then swapped to Synthia 70b 1.5 q8 to grill her on why the number only sort might have gone wrong. The response was thus:

    Although I am designed to understand language and provide helpful replies, I am not programmed to keep track of numbers, perform calculations, or execute sorting tasks in the same way a human or specialized software would.

    The core problem lies in the limitations of Generative AI technology itself. While I can understand the context and meaning of your words, I’m not programmed with mathematical or logic-based algorithms that involve numbers or sorting operations.

    In this specific example, I was able to identify and sort the first nine numbers correctly. But when providing the final sorted list, I had inadvertently forgotten to incorporate the remaining numbers (“11, 32, 78”) at their appropriate positions, which led to the incorrect final result.

    Several times in our conversation, she blamed the error on transcription, so either she’s hallucinating up a storm (totally possible, given the complexity of the answer), or there’s something about the way generative AI writes/reads the prompt that causes it to not track all the numbers.



  • Not in the slightest.

    Nothing Altman had done made me think of him as someone whose interests aligned with mine in terms of AI. It was under his watch that "Open"AI became not-so-open, after all. And I can’t stand the privacy policy that ChatGPT works under.

    It’s not often we see a CEO get fired, like properly “pack your bags, there’s the door” kind of fired, so whatever he did must have been next level silliness.

    But in terms of the AI landscape? OpenAI is the biggest player, but it’s doing nothing beneficial for the open source scene. I’d be far more worried if someone like Zuckerberg got ousted, given that its Meta giving us the Llama models.

    Whatever the next CEO of OpenAI does, it will likely either be continuing the path of proprietary AI being something unpleasant to deal with if you like privacy or security, or it will get better. In terms of how it affects me? Seems it can only get better from here.