madmax_br5

madmax_br5

I have some old engineering textbooks and wanted to try taking pictures of the pages, extracting the text with a vision model, and using this data to fine-tune an LLM. I may need to fine-tune the vision model first in order to parse the text into a markdown format. But my question is which base vision model to use, especially given the dense nature of the text. These models are not well documented in terms if what input resolutions they support. Nougat? Bakllava? Tesseract? Would appreciate advice on a good starting point to avoid burning too much time down the wrong path.

In summary:

Goal is to extract text from pictures of textbook pages into markdown format.
Photos will be normal ~12MP images captured with my phone camera, one page per photo

Best vision model for dense OCR?

Best vision model for dense OCR?