TL:DR is there an example someone can point me to for RAG with highly structured documents where the agent returns conversation along with cross references to document paragraphs or sections? Input= long text document (~500-1000 page), output is Q/A with references to document paragraph, page, or other simple cross reference.

I’ve been looking into RAG in my (extremely limited) spare time for a few months now but I’m getting hung up on vector databases. It may be due to the fact that my use case revolves around highly structured specification documents where I desire to be able to recover section and paragraph references in a QA session with a rag assistant.

Most off-the-shelf solutions seem to not care what your data looks like and just provides a black box solution for data chunking and vectoring, like having a single HTML link for a website for the source information and magically it works. This confuses me because langchain has a great learning path that includes quite a bit of focus on proper data chunking and vector database structuring, then literally every example treats the chunking and vector store step as an afterthought. I don’t like to do something I don’t understand so I’ve been focused more on creating a database for my data that makes sense in my brain.

I have successfully created a local vector database (sqlite) with SBERT that returns paragraph numbers with a similarity search but I haven’t bridged that to feeding those results into an LLM.

Am I thinking too hard about this? Are the off the shelf rag solutions able to handle the paragraph numbers without me explicitly trying to cram them into a database structure? Or am I on the right path, and I should continue with the database that makes sense to me and keep figuring out how to implement the LLM step after the vector search?

I started looking at llamaindex, then Langchain, now autogen. But my spare time is limited enough that I haven’t implemented anything with any of these, only a (successful) sbert similarity search which didn’t use any of these. If someone has an example for structured documents where the q/a provides cross-references, I’d really appreciate it.

  • SatoshiNotMeB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Langroid has a DocChatAgent, you can see an example script here -

    https://github.com/langroid/langroid-examples/blob/main/examples/docqa/chat.py

    Every generated answer is accompanied by Source (doc link or local path), and Extract (the first few and last few words of the reference — I avoid quoting the whole sentence to save on token costs).

    There are other variants of RAG scripts in that same folder, like multi-agent RAG (doc-chat-2.py) where you have one master agent delegating smaller questions to a retrieval agent and asking it in different ways if it can’t answer etc. There’s also a doc-chat-multi-llm.py where you can have the master agent powered by GPT4 and the RAG agent powered by a local LLM (because after all it only needs to do extraction and summarization).