Looking for a 3090 to LLM on? ZOTAC GAMING GeForce RTX 3090 Ti AMP Extreme Holo [Open Box] 2 Year Warranty - $864.

fallingdowndizzyvr · 1 year ago

That’s why Intel is pitching OneAPI. They want it to be the single API to bring everything together. That’s why it also supports nvidia GPUs, AMD GPUs, CPUs and even FPGA.

fallingdowndizzyvr · 1 year ago

Yes, that M1 Max should running LLMs really well including 70B with decent context. A M2 won’t be much better. A M3, other than the 400GB/s model, won’t be as good. Since everything but the 400GB/s has had the memory bandwidth cut from the M1/M2 models.

Are you seeing that $2400 at B&H? It was $200 cheaper there a couple of weeks ago. It might be worth it to see if the price goes back down.

fallingdowndizzyvr · 1 year ago

There are quite a few Intel projects in AI. There’s also the optimized DirectML they made with Microsoft. So anything that supports DirectML should also be well optimized on Intel hardware. Both CPUs and GPUs.

fallingdowndizzyvr · 1 year ago

The easiest thing to do is to get a Mac Studio. It also happens to be the best value. 3x4090s at $1600 each is $4800. That’s just for the cards. Adding a machine to put those cards into will cost another few hundred dollars. Just the cost of 3x4090s put you into Mac Ultra 128GB range. Adding the machine to put those cards into puts you in Mac Ultra 192GB range. With those 3x4090s you only have 72GB of RAM. Both those Mac options give you much more RAM.

fallingdowndizzyvr · 1 year ago

You can run a model of any size even without much RAM. As long as you have it on disk. Which you would need to have anyways. Use mmap. That maps the file as if it was RAM and runs directly off disk. It’ll be as slow as hell since it’s now bound by disk i/o. But unless you have a ton of system RAM. The method described here is also bound by disk i/o.

fallingdowndizzyvr · 1 year ago

There’s no point to it. Since if it’s too big to fit in RAM, it would be disk i/o that would be the limiter. Then it wouldn’t matter if you had 400GB/s of memory bandwidth or 40GB/s. Since the disk i/o would be the bottleneck.

fallingdowndizzyvr · 1 year ago

Because it wouldn’t be any faster than doing CPU inference. Since both CPUs and GPUs are already waiting around for data to process. It’s that i/o that’s the limiter. This changes none of that.

fallingdowndizzyvr · 1 year ago

That’s where context shifting comes into play. You don’t re-evaluate the entire context. You just process the additions.

“Previously, we had to re-evaluate the context when it becomes full and this could take a lot of time, especially on the CPU. Now, this is avoided by correctly updating the KV cache on-the-fly:”

https://github.com/ggerganov/llama.cpp/pull/3228

fallingdowndizzyvr · 1 year ago

So I guess the default is: 49152

It is. To be more clear, llama.cpp tells you want the recommendedMaxWorkingSetSize is. Which should match that number.

fallingdowndizzyvr · 1 year ago

Looking for a 3090 to LLM on? ZOTAC GAMING GeForce RTX 3090 Ti AMP Extreme Holo [Open Box] 2 Year Warranty - $864.

fallingdowndizzyvr · 1 year ago

I’m really interested in having a 51B model. I would love something between 34B and 65/70B.

fallingdowndizzyvr · 1 year ago

Absolutely. That is a much better way to do it. But that was a recent development. More recent than this thread. That post at github about doing it that way happened 3 hours after I posted this thread.

fallingdowndizzyvr · 1 year ago

As per the latest developments in that discussion, “iogpu.wired_limit_mb” only works on Sonoma. So if you are on an older version of Mac OS, try “debug.iogpu.wired_limit” instead.

fallingdowndizzyvr · 1 year ago

I guess you’ve only used a 7B model. IMO, the magic doesn’t really start happening until 30B.

fallingdowndizzyvr · 1 year ago

Yes. I’ve done that before on my other machines. Llama.cpp in fact defaults to that. The hope for me was that since the models are sparse that the OS would cache the relevant parts of the models in RAM. So the first run through would be slow but subsequent runs would be fast since those pages are cached in RAM. How well that works or not really depends on how much RAM the OS is willing to use to cache mmap and how smartly it does it. My hope was that if it did it smarty with sparse data that it would be pretty fast. So far, my hopes haven’t been realized.

fallingdowndizzyvr · 1 year ago

It’s a perfectly sensical comparison. If anything, it’s far easier to upgrade the RAM on a 3090 than a M Mac.

fallingdowndizzyvr · 1 year ago

Definitely. It’s a much better way to do it for a variety of reasons. Not least of which is that the kernel patch is kernel dependent so will need to be kept up to date. Setting this system variable isn’t. Unless Apple removes it. It should keep working in future releases of Mac OS.

fallingdowndizzyvr · 1 year ago

You replace it with the number MB, no a percentage. The program that patched the kernel took a percentage. This takes a number of MB. So for example, 30000 would be 30GB. A great place to get the number you need is from llama.cpp. I tells you how much RAM it needs.

This is a new development. It wasn’t posted until after I started this thread. It’s even better since you don’t have to patch the kernel.

fallingdowndizzyvr · 1 year ago

There are no new 3090 so comparing the cost to a new 3090 is pointless as its basically just scalped overprized new 3090s left.

I’m not comparing it to the cost of a new 3090. I clearly said I was comparing it to the price of a used 3090.

“The M1 32GB Studio may be the runt of the Mac Studio lineup but considering that I paid about what a used 3090 costs on ebay for a new one”

fallingdowndizzyvr · 1 year ago

I can’t wait for ultrafastbert. If that delivers on the promise then it’s a game changer that will propel CPU inference to the front of the pack. For 7B models up to a 78x speedup. The speedup decreases as the number of layers increase, but I’m hoping at 70B it’ll still be pretty significant.

fallingdowndizzyvr · 1 year ago

I just don’t login using the GUI. There indeed doesn’t seem to be a way to turn it off like in Linux. So it still uses up 10s of MB waiting for you to login. But that’s a far cry from the 100’s of MB if you do login. I have thought about killing those login in processes but the Mac is so GUI centric that if something really goes wrong and I can’t ssh in, I want that as backup. I think a few 10’s of MB is worth it for that instead of trying to fix things in the terminal in recovery mode.

fallingdowndizzyvr · 1 year ago

Macs with 32GB of memory can run 70B models with the GPU.

fallingdowndizzyvr · 1 year ago

The Acer Intel A770 16GB GPU is now $250. You won't find a better new 16GB GPU for less.

fallingdowndizzyvr · 1 year ago

OpenAI brings Sam Altman back as CEO

fallingdowndizzyvr · 1 year ago

OpenAI brings Sam Altman back as CEO

fallingdowndizzyvr · 1 year ago

667 of OpenAI's 770 employees have threaten to quit. Microsoft says they all have jobs at Microsoft if they want them.

fallingdowndizzyvr · 1 year ago

Sam Altman out as CEO of OpenAI. Mira Murati is the new CEO.

fallingdowndizzyvr · 1 year ago

Microsoft announced the Maia 100 AI Accelerator Chip. It's also expanding the use of the AMD MI300 in it's datacenters. Is this the beginning of the end of CUDA dominance?