I understand that a bigger memory means you can run a model with more parameters or less compression, but how does context size factor in? I believe it’s possible to increase the context size, and that this will increase the initial processing before the model starts outputting tokens, but does someone have numbers?

Is memory for context independent on the model size, or does a bigger model mean that each bit of extra context ‘costs’ more memory?

I’m considering an M2 ultra for the large memory and low energy/token, although the speed is behind RTX cards. Is this the best option for tasks like writing novels, where quality and comprehension of lots of text beats speed?

  • a_beautiful_rhindB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    P40 is getting long in the tooth but there is nothing that beats the price. I keep looking at what else I could buy that gives decent performance and realize it’s it or 3090. I really really wish it was faster or had an exllama that could just push it to 12-13 t/s

    intel + nvidia won’t cooperate so you’ll have to have different environments and while that’s great for running encapsulated things like TTS, it sorta sucks for training or trying bigger models. Has kept me from buying the cheap Mi25. Otherwise they will mailny sit and eat idle watts for nothing as I found out with my extra P40. 3x24 covers most LLM. 4th card gets used for SD/TTS/whatever and 5th card stays fallow.

    With a workstation like a P920, you are really only gaining the ram capacity. The point of those big supermicros is so you can fit more than 2 cards at full speed.

    If you are just going to spend on A6000 or RTX8000, then almost anything that can do at least 128g of ram will be enough. I would be more inclined with 6k to cobble together an epyc board with a mining case as then I have single CPU, all the ram I want and at least 4 x16 slots.