Some of the bigger/better models make me think local is doing pretty well and it is at chat, but exploring data cleaning has taken a bit of wind out of my sail.
Not having much luck with the ones I’ve tried (think 34B Q5 of various flavours - all the usual suspects).
Say I’ve got a paragraph about something and the text block contains some other unrelated comment. Let’s say “subscribe to our news letter” in it or some other web scraping artifact. I’d like to give the LLM an instruction to filter out content not related to the paragraph topic.
Local LLMs…mostly failing. GPT3.5…failing I’d say 40% of the time. GPT4…usually works…call it 90.
That’s not entirely surprising, but the degree to which locals are failing at this task relative to closed is frustrating me a bit.
Hell for some 34Bs I can’t even get the local ones to surpress the opening
Here’s the cleaned article:
…when the prompt literally says word for word don’t include that. Are there specific LLMs for this? Or is my prompting just bad?
You are an expert at data cleaning. Given a piece of text you clean it up by removing artifacts left over from webscraping. Remove anything that doesn’t seem related to the topic of the article. For example you must remove links to external sites, image descriptions, suggestions to read other articles etc. Clean it up. Remove sentences that are not primarily in English. Keep the majority of the article. The article is between the [START] and [END] marker. Don’t include [START] or [END] in your response. It is important that there is no additional explanation or narrative added - just respond with the cleaned article. Do not start your response with “Here’s the cleaned article:”
Unrelated - openai guidance says use “”" as markers not the start/end I’ve got. Anybody know if that is true for locals?
I don’t think you should be surprised that a 34B model is mostly failing, considering the fact that a 200B model (GPT-3.5) is only getting to 40%. What you’re asking the LLM to do is very hard for it without further training/tuning.
most all of what you wrote can be done with python out of the box
You need an AI like GPT 4 not an LLM
Ideally we would be better in a timeline where LLMs could do this better than classical methods but we’re not there yet. You can code a handler that cleans up html retrieval quite trivial since you’re just looking for the text in specific tags like articles, headers, paragraphs, etc. There are a ton of frameworks and examples out there on how to do this and a proper handler would execute the cleanup in a fraction of the time even the most powerful LLM ever hoped to.
Sort-of.
Refuel.ai finetuned a 13B llama 2 for data labeling; not hard to imagine applications for that here if the data volume were reasonable. Simplest thing that might work: take a paragraph at a time and have a data labeling model answer “Is this boilerplate or content?”
Another possibility is using the TART classifier head from Hazy Research, find as many as 256 pairs of boilerplate vs. content, and use only as large a model as you need to get good classification results. If your data volume is large, you would do this for a while, get a larger corpus of content vs. boilerplate, and train a more efficient classifier with fasttext or something similar (probably bigram based).