choyakishuB to Machine Learning@academy.gardenEnglish · 1 year ago

[D] Help with extracting citations from text

2

1

[D] Help with extracting citations from text

choyakishuB to Machine Learning@academy.gardenEnglish · 1 year ago

2

Hi everyone. I have a volunteering project and the goal is to extract the citations (not in-text citations, usually the citations at the end) from pdfs. We converted the pdfs into strings of texts. But since the OCR module (pytesseract) did not output clean texts and formats, there are several misspelling words and weird symbols.

I have attempted to use regex to extract citations. But I failed because there is not a fixed set of custom rules that I can come up with to take all cases into account. Do you have any experience with this?

I tried ChatGPT by pasting a sample text and it gave me such a clean and accurate result of citations. But I want to build a model myself since we are working on our own task which is automating the process of extracting citations from many pdf documents.

I have looked a bit into some LLMs that focus on GPT because I think only generative text models can help us. Text classification models or so are less flexible in our case. Should I attempt to pretrain a model? I know LLaMA-2 7gb is good? But not sure if it’s free. OpenAI APIs are out of scope due to our financial constraint. I am relatively new to NLP but I’m confident about my Python programming and statistical techniques.

What is your suggestions? Any thought is appreciated. Thank you for your time!

You must log in or register to comment.

Chat

Ma1kaNB
link
fedilink
English
arrow-up
1·
1 year ago
Hi, you should try https://wizano.io an AI Powered content generator. You can generate images (DALL-E3, Stable Diffusion XL), voices (Microsoft Azure, Google and OpenAI), speech-to-text (Google), any kind of text (GPT-4 Turbo), and many more!

Machine Learning@academy.garden

machinelearning@academy.garden

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !machinelearning@academy.garden

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

1 user / day
1 user / week
1 user / month
1 user / 6 months
11 local subscribers
14 subscribers
793 Posts
3.09K Comments
Modlog

mods:
communick@academy.garden