Hey new to this world since chatgpt, I’ve been thinking about the training process for large language models and have a question for the community regarding the structure of datasets used in pre-training and fine-tuning phases.

My understanding is that these models are exposed to a diverse and randomized array of data. Each datum is adjusted to fit a certain context length, ensuring that the model learns from the content itself rather than the order in which it’s presented. However, I’m curious if there has been any research or experimentation with structuring these datasets to follow a curriculum learning approach.

In traditional education, students progress from simple to complex concepts, building upon what they’ve learned as they advance. Could a similar approach benefit AI training? For instance, starting with simpler language constructs and concepts before gradually introducing more complex and abstract ones?

The idea would be to categorize training data by complexity, then batch it so that the model first learns from ‘easier’ data, with the complexity scaling up as training progresses. Randomization could still occur within these batches to prevent memorization of sequence rather than understanding.

I’m interested in any insights or references to research that has explored this idea. Does curriculum learning improve the efficacy of language models? Could it lead to more nuanced understanding and better performance in complex tasks?