Hello all,
I got two NVIDIA P40 with in total 48GB vRAM and are trying to train a LLaMAv2 base instruction following model which has a base context size of 8.192 (8K). I successfully trained and had really strong results with the default SFTTrainer using 2.048 and 4.096 so, 2k and 4k. However when I switch it 8K I always hit the OOM wall. I set all to 4bit and the initial loading memory use is less then 2-3GB per GPU but the moment he starts training it dies. Does anyone have an idea or suggestion here?
I tried double quant but it is not compatible with the P40, same as for flash attention.
For my use case I need an 8K context. So far all my previous tests with 2-4K went really good with strong results so I am quite confident in my overall training setup.
With fastapi I managed to run even 60B and 34B models for inference using 4bit and a special split GPU switch where I could limit the GPU memory usage to 18GB:22GB (don’t know why but only this worked stable). I wonder if something similar can help here.
For batch size I tried all from 1 to 64 (which I used successfully with smaller context sizes).
Thanks a lot!
Are you doing a full finetune?
Try a LoRA, or better yet LongLora which is specifically optimized for long context: https://github.com/huggingface/peft/issues/958
Hi u/mcmoose1900 thanks a lot for the reply!
By my understanding, i already make use of peft and lora since starting this endeavour.
See excerpts of the code here (there is a chance that maybe it does not get used as intended due to the often weird ways Python works).
bnb_config = BitsAndBytesConfig( load_in_4bit=True, load_in_8bit=False, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16 ) base_model = AutoModelForCausalLM.from_pretrained( base_model_name, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, )
and here
peft_config = LoraConfig( lora_alpha=16, lora_dropout=0.2, r=64, bias="none", task_type="CAUSAL_LM", ) max_seq_length = MAX_SEQ_LENGTH trainer = SFTTrainer( model=base_model, train_dataset=train_dataset, eval_dataset=eval_dataset, peft_config=peft_config, formatting_func=formatting_func, max_seq_length=max_seq_length, tokenizer=tokenizer, args=training_args, )
and the parameters
MAX_SEQ_LENGTH = 8192 LEARNING_RATE = 2e-5 PER_DEVICE_BATCH_SIZE = 1 GRADIENT_ACCUMULATION_STEPS = 1 USE_EVAL = True QUANT_BIT_8 = False QUANT_BIT_4 = not QUANT_BIT_8
The numbers above are very low as i tried lowering them to mitigate the OOM issue without success. Normally they would not make sense.