I come from computer vision tasks with convnets that are relatively small in size and parameters, yet performing quite well (e.g. ResNet family, YOLO, etc.).

Now I am approaching some NLP and architectures based on transformers tend to be huge, so that I have problems to fit them in memory.

What infrastructure you use to train these model (GPT2, BERT or even the bigger ones)? cloud computing, HPC, etc.