I’m currently working on a document selection problem and needed some inputs on how to proceed further in solving thisThe input I have is user data and I’ve to return a set of documents which match the user is talking about in the description.

The list of documents is very huge around 2-3 Million records and they are very unstructured and user input necessarily might not be present in the document.

Currently I’ve tried the following

  1. Create summary of all the documents using llm
  2. Create embedding of this and store it in vector db.
  3. Create summary of user input on the fly
  4. create embedding of the summarised user input and search in vector db and return top x documents with probability >=y

This does get me documents very quickly but there are a lot of false positives and I’m not sure how to reduce these false positives.

One of the thing I found is user query might not be present in the documents directly so in this case there are a lot of false positives.

Is any any other way to solve this selection problem or reduce the number of false positives that come up in vector search ?

I also tried re ranking with BM25 algorithm but it did not help a lot