Hey r/MachineLearning!
Last year, u/rajatarya showcased how we scaled Git to handle large datasets. One piece of feedback we kept getting is that people didn’t want to move their source code over to XetHub.
So we built a GitHub app & integration that lets you continue storing code in GitHub while XetHub handles the large datasets & models.
https://about.xethub.com/blog/xetdata-scale-github-repos-100-tb
We’ve enjoyed using it to host open source LLM’s like Llama2 and Mistral with our finetuning code side-by-side.
The whole thing is in beta so we’re eager for any feedback you have to offer :)
Good questions:
- DVC: no new commands to learn (we extend Git) and you don’t need S3.
- Git LFS: we inject useful views into your large files inside GitHub itself (in commits and PR’s) unlike Git LFS (e.g. check this model diff: https://youtu.be/lAyymscJUvI?t=87), we scale to much larger sizes (100 terabytes), and we deduplicate better (Git LFS considers a 1 line change to a large CSV file a new entire file, our technique captures the differences)