I’m currently working on a document selection problem and needed some inputs on how to proceed further in solving thisThe input I have is user data and I’ve to return a set of documents which match the user is talking about in the description.
The list of documents is very huge around 2-3 Million records and they are very unstructured and user input necessarily might not be present in the document.
Currently I’ve tried the following
- Create summary of all the documents using llm
- Create embedding of this and store it in vector db.
- Create summary of user input on the fly
- create embedding of the summarised user input and search in vector db and return top x documents with probability >=y
This does get me documents very quickly but there are a lot of false positives and I’m not sure how to reduce these false positives.
One of the thing I found is user query might not be present in the documents directly so in this case there are a lot of false positives.
Is any any other way to solve this selection problem or reduce the number of false positives that come up in vector search ?
I also tried re ranking with BM25 algorithm but it did not help a lot