I was going through a paper called MILAN which is a pre-training method to teach the model good Visual representations and one thing that struck me is the large no. of epochs we used to train models on (see image) even if we want the model to be able to generalize well. So I’m curious to know why even base models are only trained with a low epoch count.
TIA.
You must log in or register to comment.