Hey everyone,

I have a dataset that has around 8million pairs of prompts and responses collected and curated from a bunch of open-source datasets on hf. I wanted to know what’s the best method to dedup this dataset. I am planning on doing this locally (4090 with 64gb ram) and I’ve looked into a few methods but I wasn’t able to use those in my case cuz of my compute constraints.

Please let me know if y’all know a efficient method I can use!

TIA.