Point me towards some basic dataset preparation tips for LLM's?

ArtifartX · 2 years ago

Point me towards some basic dataset preparation tips for LLM's?

FPham · 2 years ago

Trained and finetuned - 2 things.

The trained on wikipedia - yes, they feed the wikipedia articles to it - hook and sinker. No Q/A. But that doesn’t mean it will be able to give you answer, unless you fine tune it with Q/A “I want you to behave like this” template - but the kick is - what we all are using to our huge advantage - it can be fine-tuned on a totally different Q/A, it will still be able to answer from wikipedia. It’s a hat trick.

psdwizzard · 2 years ago

I am new to LLMs (I normally train Image Models) so if this is a stupid question let me know.

I have been converting the shadowrun lore wiki into Q and A so i can use that model for a sillytavern character as a contact in my current tabletop game. Do I really need to convert it all to Q and A? If I get a better “Contact” I dont mind.

ArtifartX · 2 years ago

Thanks for the information and explanation

__SlimeQ__ · 2 years ago

if you’re making a lora, training on wikipedia directly will pretty much make it output text that looks like wikipedia. which is to say it will (probably) be worse at chatting.

a strategy i’ve been using lately is to get gpt4 to make a conversation in my chosen format *about* each chapter of my “textbook”, i can automate this with pretty good results and it’s done in about 10 minutes. It does kind of work, it’ll at least get the bot to talk about the topics I chose, but as far as actually comprehending the information it’s referencing… it’s bad. It gets better as I increase rank, but it takes a lot of VRAM. I can only get to around 256 before it’ll die

Tiny_Arugula_5648 · 2 years ago

Go to huggingface and look at the multitude of datsets that have already been prepped and read whatever documentation and papers that have been published. Go through the data and get a sense of what the data looks like and how it’s structured.

ArtifartX · 2 years ago

Yea, doing this is part of what spurred the question, because I began to notice some datasets that were very clean and ordered into data pairs, and others that seemed formatted differently, and others still that seemed like they were fed a massive chunk of unstructured text. It made me confused on if there were some sort of standards or not that I was not aware of.