Dataset: https://huggingface.co/datasets/allenai/MADLAD-400

Note that the english subset in this version is missing 18% of documents that were included in the published analysis of the dataset. These documents will be incoporated in an update coming soon.

arXiv paper: https://arxiv.org/abs/2309.04662

Models: https://github.com/google-research/google-research/tree/master/madlad_400

u/jbochi’s work on getting the models to run: https://www.reddit.com/r/LocalLLaMA/comments/17qt6m4/translate_to_and_from_400_languages_locally_with/

  • APaperADayOPB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Credit to u/jbochi for getting the models to run + telling Google to fix their model checkpoints.