Hi everyone. I have a volunteering project and the goal is to extract the citations (not in-text citations, usually the citations at the end) from pdfs. We converted the pdfs into strings of texts. But since the OCR module (pytesseract) did not output clean texts and formats, there are several misspelling words and weird symbols.

I have attempted to use regex to extract citations. But I failed because there is not a fixed set of custom rules that I can come up with to take all cases into account. Do you have any experience with this?

I tried ChatGPT by pasting a sample text and it gave me such a clean and accurate result of citations. But I want to build a model myself since we are working on our own task which is automating the process of extracting citations from many pdf documents.

I have looked a bit into some LLMs that focus on GPT because I think only generative text models can help us. Text classification models or so are less flexible in our case. Should I attempt to pretrain a model? I know LLaMA-2 7gb is good? But not sure if it’s free. OpenAI APIs are out of scope due to our financial constraint. I am relatively new to NLP but I’m confident about my Python programming and statistical techniques.

What is your suggestions? Any thought is appreciated. Thank you for your time!

  • Ma1kaNB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Hi, you should try https://wizano.io an AI Powered content generator. You can generate images (DALL-E3, Stable Diffusion XL), voices (Microsoft Azure, Google and OpenAI), speech-to-text (Google), any kind of text (GPT-4 Turbo), and many more!