I’m using a Colab notebook with a T4 to have Llama 2 summarize all tables in a PDF. It works for the first 10 or so tables, and then I’m getting the dreaded CUDA out of memory error.

It seems like each successive summarization call is accumulating on the GPU, is there some way to clear the allocated memory from the previous call so that the memory allocated doesn’t build up?

  • AaaaaaaaaeeeeeB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Long context is useless without flash-attention.

    • +4k = 2gb
    • +8k = 4gb
    • +16k = 16gb
    • +32k = 256gb
      • 64k = 65536gb