• 1 Post
  • 6 Comments
Joined 10 months ago
cake
Cake day: November 17th, 2023

help-circle


  • My understanding is that with LoRA you reduce the number of trainable parameters and therefore the memory needed to track optimizer states (e.g for Adam that tracks 2 state parameters for each model parameter). This means that you need far less RAM to fine-tune the model. Imagine 70B parameters * 4 bytes for fp32 training plus 70B * 8bytes for Adam. Lora reduces that second part to say 1% of 70B * 8 bytes.

    You can also use gradient checkpointing, which isn’t specific to LoRA, to reduce memory consumption at the expense of training time. Here you recompute activations during back-prop and cache some intermediate activations.

    Can you explain what you mean by “caching intermediate gradients during backprop”? I’m not familiar with what that is.