Fine tune as in gradient updates or as in ICL?
Go back to your history: Cauchy is the earliest person I’m aware of to have used gradient descent, and he motivated it as
one ordinarily starts by reducing them to a single one by successive eliminations, to eventually solve for good the resulting equation, if possible. But it is important to observe that 1◦ in many cases, the elimination cannot be performed in any way; 2◦ the resulting equation is usually very complicated, even though the given equations are rather simple
That is, the usefulness of gradient descent is motivated when you have rough idea of when you are close to the minimum, but you don’t want to go through the hassle of algebra. (realistically, if you can solve it with gradient descent, you could probably solve it algebraicly, we just don’t have the same stupidly easy to implement computational routines for it)
https://www.math.uni-bielefeld.de/documenta/vol-ismp/40_lemarechal-claude.pdf
Just to be clear, you aren’t doing fine tuning here as in gradient updates, you are using the base model + ICL?