What are some good options these days for llm work - primarily for fine tuning and other related experiments? This for personal work and proof of concept type stuff and will be out of pocket so I’d definitely prefer cheaper options. I’d mostly be using 7-13b models but later would want to test with larger models as well.

Most of the providers have on demand and spot options, with spot options being obviously cheaper. I understand the spot instances can go down at any time but assuming checkpoints are saved regularly and can resume later that shouldn’t be a big problem. Are there any gotchas here?

The other criteria is managed/secure environment vs some kind of open/community environment. Again the later options are cheaper and assuming security is not a major requirement that seems like the better choice. Any thoughts or advice on this one?

I’m mostly looking at runpod, vast, and replicate based on info from other threads. Are there any other providers folks had good experience with?

How do AWS, GCP, or Azure compare to these options? From what I can tell these seem more expensive but I haven’t looked at these too closely.

Any recommendations with some details on your own experience, use cases, and costs would be greatly appreciated.