I’ve wondered this, and hope you get better answers.
One thing you could do if it fit your use-case: align GDELT entries and news stories in realnews dataset on huggingface, train a model to output the extracted info from the article.
Another is have GPT-4 so some examples on lightly faked / anonymized data and then distill that into a model that does well on information extraction evals (which are a thing iirc).
The reason this was necessary is a bit funny: as Google reported in their MADLAD paper, over half of Mandarin CommonCrawl text is porn. Perhaps a robots.txt issue? Great firewall? I have no idea.