It’s no secret that many language models and fine-tunes are trained using datasets, many of them are made using GPT models. The problem arises when many “GPT-isms” end up in the dataset. And I am not only referring to the typical expressions like “however, it’s important to…”, “I understand your desire to…”, but I am also referring to the structure of the outputs in the model’s responses. ChatGPT (GPT models in general) tend to have a very predictable structure when in its “soulless assistant” mode, which makes it very easy to say “this is very GPT-like”.
What do you think about this? Oh, and by the way, forgive my English.
I think the GPT-isms maybe why my AI storywriting attempts tend to be overly positive and cliched. Not exactly a world shattering problem but it is annoying shakes fist.
I think if I thought a possible serious problem, it’s that the biases that OpenAI initially inserted into ChatGPT and their GPT models now spread around the local models as well.
It’s annoying because it feels like all models respond to questions in a similar way. Some are just a bit smarter than others or tuned to respond a bit differently.
If the GPT-like data spreads around Internet as well then it might be difficult to avoid having it in training data unless you only include old data in your training.