Hello all,

I got two NVIDIA P40 with in total 48GB vRAM and are trying to train a LLaMAv2 base instruction following model which has a base context size of 8.192 (8K). I successfully trained and had really strong results with the default SFTTrainer using 2.048 and 4.096 so, 2k and 4k. However when I switch it 8K I always hit the OOM wall. I set all to 4bit and the initial loading memory use is less then 2-3GB per GPU but the moment he starts training it dies. Does anyone have an idea or suggestion here?

I tried double quant but it is not compatible with the P40, same as for flash attention.

For my use case I need an 8K context. So far all my previous tests with 2-4K went really good with strong results so I am quite confident in my overall training setup.

With fastapi I managed to run even 60B and 34B models for inference using 4bit and a special split GPU switch where I could limit the GPU memory usage to 18GB:22GB (don’t know why but only this worked stable). I wonder if something similar can help here.

For batch size I tried all from 1 to 64 (which I used successfully with smaller context sizes).

Thanks a lot!

    • Bright-Question-6485
      cake
      OPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Hi u/mcmoose1900 thanks a lot for the reply!

      By my understanding, i already make use of peft and lora since starting this endeavour.

      See excerpts of the code here (there is a chance that maybe it does not get used as intended due to the often weird ways Python works).

      bnb_config = BitsAndBytesConfig(
          load_in_4bit=True,
          load_in_8bit=False,
          bnb_4bit_quant_type="nf4",
          bnb_4bit_use_double_quant=True,
          bnb_4bit_compute_dtype=torch.float16
      )
      
      base_model = AutoModelForCausalLM.from_pretrained(
          base_model_name,
          quantization_config=bnb_config,
          device_map="auto",
          trust_remote_code=True,
      )
      

      and here

      peft_config = LoraConfig(
          lora_alpha=16,
          lora_dropout=0.2,
          r=64,
          bias="none",
          task_type="CAUSAL_LM",
      )
      
      max_seq_length = MAX_SEQ_LENGTH
      trainer = SFTTrainer(
          model=base_model,
          train_dataset=train_dataset,
          eval_dataset=eval_dataset,
          peft_config=peft_config,
          formatting_func=formatting_func,
          max_seq_length=max_seq_length,
          tokenizer=tokenizer,
          args=training_args,
      )
      

      and the parameters

      MAX_SEQ_LENGTH = 8192
      LEARNING_RATE = 2e-5
      PER_DEVICE_BATCH_SIZE = 1
      GRADIENT_ACCUMULATION_STEPS = 1
      USE_EVAL = True
      QUANT_BIT_8 = False
      QUANT_BIT_4 = not QUANT_BIT_8
      

      The numbers above are very low as i tried lowering them to mitigate the OOM issue without success. Normally they would not make sense.