Point me towards some basic dataset preparation tips for LLM's?

ArtifartX · 1 year ago

I’ve definitely seen a few of those.

ArtifartX · 1 year ago

Anyone dumb enough to take a timeline that comes out of Musk’s mouth seriously for anything in this day and age…

ArtifartX · 1 year ago

I agree with your sentiment here. But, you can’t deny the influx of papers that are intentionally something extremely simple or inconsequential that are deliberately dressed up to try to look as complex as possible just in order to get published. Regardless of your sentiment (which again I agree with mostly), those kinds of papers are not good and we’d all be better off without them. I think there is a place for shame for certain types of papers, and I would disagree with the idea that shame is always bad or shouldn’t be used as a tool.

ArtifartX · 1 year ago

If the fact was at the beginning of the document, it was recalled regardless of context length

Lol at OpenAI adding a cheap trick like this, since they know the first thing people will test at high context lengths is recall from the beginning.

ArtifartX · 1 year ago

Yea, doing this is part of what spurred the question, because I began to notice some datasets that were very clean and ordered into data pairs, and others that seemed formatted differently, and others still that seemed like they were fed a massive chunk of unstructured text. It made me confused on if there were some sort of standards or not that I was not aware of.

ArtifartX · 1 year ago

Awesome, thank you, at a glance this looks like it will be very helpful

ArtifartX · 1 year ago

Thanks for the information and explanation

ArtifartX · 1 year ago

Point me towards some basic dataset preparation tips for LLM's?

ArtifartX · 1 year ago

Replicate

ArtifartX · 1 year ago

What service do you use for GPU rental and inference for it?

ArtifartX · 1 year ago

It is still relatively censored, but a great base to work with.