[D]: Understanding GPU Memory Allocation When Training Large Models

lightSpeedBrick · 1 year ago

Ah, I hadn’t thought of that. I’ll look into it. Thank you for the suggestion!

lightSpeedBrick · 1 year ago

[D]: Understanding GPU Memory Allocation When Training Large Models

lightSpeedBrick · 1 year ago

My understanding is that with LoRA you reduce the number of trainable parameters and therefore the memory needed to track optimizer states (e.g for Adam that tracks 2 state parameters for each model parameter). This means that you need far less RAM to fine-tune the model. Imagine 70B parameters * 4 bytes for fp32 training plus 70B * 8bytes for Adam. Lora reduces that second part to say 1% of 70B * 8 bytes.

You can also use gradient checkpointing, which isn’t specific to LoRA, to reduce memory consumption at the expense of training time. Here you recompute activations during back-prop and cache some intermediate activations.

Can you explain what you mean by “caching intermediate gradients during backprop”? I’m not familiar with what that is.

lightSpeedBrick · 1 year ago

Oh, don’t get me wrong, the dominant sentiment on r/singularity is not for me and I am no fan of the reverence certain public figures get from members of that community. I was going for polite understatement with my comment, but perhaps failed 😅

lightSpeedBrick · 1 year ago

What’s wrong with r/singularity? Folks over there are optimistic, perhaps a little too eager and optimistic. In fact most opinions that aren’t optimistic get downvoted pretty quickly.

lightSpeedBrick · 1 year ago

I threaten to quit too. I don’t work at OpenAI, but I’ll quit my job and happily accept Microsoft’s offer in solidarity.

lightSpeedBrick · 1 year ago

Largely unrelated, but this has a similar vibe. I wonder what happened to that high school kid who invented the transformer even before Vaswani et al, and then a year later another guy who claimed to invent a brand new neural network architecture that was supposed to break the internet.