My understanding is that with LoRA you reduce the number of trainable parameters and therefore the memory needed to track optimizer states (e.g for Adam that tracks 2 state parameters for each model parameter). This means that you need far less RAM to fine-tune the model. Imagine 70B parameters * 4 bytes for fp32 training plus 70B * 8bytes for Adam. Lora reduces that second part to say 1% of 70B * 8 bytes.
You can also use gradient checkpointing, which isn’t specific to LoRA, to reduce memory consumption at the expense of training time. Here you recompute activations during back-prop and cache some intermediate activations.
Can you explain what you mean by “caching intermediate gradients during backprop”? I’m not familiar with what that is.
Oh, don’t get me wrong, the dominant sentiment on r/singularity is not for me and I am no fan of the reverence certain public figures get from members of that community. I was going for polite understatement with my comment, but perhaps failed 😅
What’s wrong with r/singularity? Folks over there are optimistic, perhaps a little too eager and optimistic. In fact most opinions that aren’t optimistic get downvoted pretty quickly.
I threaten to quit too. I don’t work at OpenAI, but I’ll quit my job and happily accept Microsoft’s offer in solidarity.
Largely unrelated, but this has a similar vibe. I wonder what happened to that high school kid who invented the transformer even before Vaswani et al, and then a year later another guy who claimed to invent a brand new neural network architecture that was supposed to break the internet.
Ah, I hadn’t thought of that. I’ll look into it. Thank you for the suggestion!