You must log in or register to comment.
I was going to try to knowledge distill but they modified their tokenizer.
Either way neo has a 125M model, so a 248M model is x2 that. I imagine this could be useful for shorter context tasks. Idk, or to continue training for very tight uses cases
I came across it while looking for tiny mistral config jsons to replicate⁸