[D] What is the motivation for parameter-efficient fine tuning if there's no significant reduction in runtime or GPU memory usage?

patricky168 · 1 year ago

[D] What is the motivation for parameter-efficient fine tuning if there's no significant reduction in runtime or GPU memory usage?

lightSpeedBrick · 1 year ago

My understanding is that with LoRA you reduce the number of trainable parameters and therefore the memory needed to track optimizer states (e.g for Adam that tracks 2 state parameters for each model parameter). This means that you need far less RAM to fine-tune the model. Imagine 70B parameters * 4 bytes for fp32 training plus 70B * 8bytes for Adam. Lora reduces that second part to say 1% of 70B * 8 bytes.

You can also use gradient checkpointing, which isn’t specific to LoRA, to reduce memory consumption at the expense of training time. Here you recompute activations during back-prop and cache some intermediate activations.

Can you explain what you mean by “caching intermediate gradients during backprop”? I’m not familiar with what that is.

patricky168 · 1 year ago

Yeah what I mean is that despite LoRA only updating gradients for the adapters on the attention weights, we still need to calculate gradients for downstream layers that aren’t being updated and that takes GPU memory. So the only memory saved is from the optimizer states if I am not mistaken.