Hello all,
I got two NVIDIA P40 with in total 48GB vRAM and are trying to train a LLaMAv2 base instruction following model which has a base context size of 8.192 (8K). I successfully trained and had really strong results with the default SFTTrainer using 2.048 and 4.096 so, 2k and 4k. However when I switch it 8K I always hit the OOM wall. I set all to 4bit and the initial loading memory use is less then 2-3GB per GPU but the moment he starts training it dies. Does anyone have an idea or suggestion here?
I tried double quant but it is not compatible with the P40, same as for flash attention.
For my use case I need an 8K context. So far all my previous tests with 2-4K went really good with strong results so I am quite confident in my overall training setup.
With fastapi I managed to run even 60B and 34B models for inference using 4bit and a special split GPU switch where I could limit the GPU memory usage to 18GB:22GB (don’t know why but only this worked stable). I wonder if something similar can help here.
For batch size I tried all from 1 to 64 (which I used successfully with smaller context sizes).
Thanks a lot!
Are you doing a full finetune?
Try a LoRA, or better yet LongLora which is specifically optimized for long context: https://github.com/huggingface/peft/issues/958
Hi u/mcmoose1900 thanks a lot for the reply!
By my understanding, i already make use of peft and lora since starting this endeavour.
See excerpts of the code here (there is a chance that maybe it does not get used as intended due to the often weird ways Python works).
and here
and the parameters
The numbers above are very low as i tried lowering them to mitigate the OOM issue without success. Normally they would not make sense.