I’m using a Colab notebook with a T4 to have Llama 2 summarize all tables in a PDF. It works for the first 10 or so tables, and then I’m getting the dreaded CUDA out of memory error.
It seems like each successive summarization call is accumulating on the GPU, is there some way to clear the allocated memory from the previous call so that the memory allocated doesn’t build up?
Long context is useless without flash-attention.