I was going to try to knowledge distill but they modified their tokenizer.
Either way neo has a 125M model, so a 248M model is x2 that. I imagine this could be useful for shorter context tasks. Idk, or to continue training for very tight uses cases
I came across it while looking for tiny mistral config jsons to replicate⁸
I was going to try to knowledge distill but they modified their tokenizer.
Either way neo has a 125M model, so a 248M model is x2 that. I imagine this could be useful for shorter context tasks. Idk, or to continue training for very tight uses cases
I came across it while looking for tiny mistral config jsons to replicate⁸
https://preview.redd.it/l9l7a39u3a1c1.jpeg?width=720&format=pjpg&auto=webp&s=80589cb6fbb2268b0d8af65b4ec27647185b4780