Sorry to post hardware problems but I think this is the community most likely to have dealt with similar issues.

Basically I get hardware freezing during inference.

  • EVGA X299 FTW-K
  • Intel® Core™ i9-9900X CPU @ 3.50GHz
  • 128GB DDR4 RAM
  • 1x nvidia RTX 3090
  • 1x nvidia RTX 4090
  • Corsair HX1500i Power Supply
  • two samsung nvme flash drives (one for root, one for swap)
  • two 5.5TB HDs (for backups)
  • a fractal design case with fans in every available slot
  • water cooling for CPU (kraken?)
  • various USB peripherals (topping E30, UMC202HD, Nitrokey, phone, kb, flash etc)

This is a personal workstation used for occasional LLM work with python cuda 11 libraries. I run linux (void) with a recent kernel (6.5.9_1) and nvidia-535 driver.

Often, the machine will freeze under GPU load when both GPUs are being used (i.e. LLM with layers split across both cards). This has kept happening after kernel updates and for various cuda versions. It’s stable under gaming loads (e.g. Far Cry 5/6 on either card). No error or kernel messages and nothing logged to dmesg. Just a hard freeze.

I’ve tried switching the order / PCIe slots that the GPUs are using and removing other cards.

The CPU temperature is generally below 90C under full CPU load. If I fully load the CPU it doesn’t freeze. I stopped overclocking.

The GPU temperatures stay reasonable (60C under load) although the 3090 feels quite hot (they are reported to run a bit warm). Under load they are at about 75% capacity, I presume because the bottleneck becomes the PCIe 3x bus, which I vaguely suspect may be the root of the instability.

I’m out of ideas, if anyone could suggest things to try I’d be grateful.