I'm struggling with the words extraction from a set of pdf files. This files are academic papers that I downloaded from the web.
The data is stored in my local device, sorted by name, following this relative path inside the project folder: './papers/data'. You can find my data
PDFs usually do not have an always clear concept of lines and words. They have characters/text boxes placed at certain places in the document. The extraction can't read it "char by char" like e.g. a txt file, it parses it from top left to bottom right and uses the distances to make assumptions what is a line, what is a word etc. Since the one in the first picture seems to not only use the space character but also character margins to the left and right to create a nicer spacing for the text, the Parser has difficulty understanding it.
Every Parser will do that slightly different, so it might make sense to try out some different parsers, perhaps another one was trained/designed on documents with similar patterns and is able to parse it correctly. Also, since the PDF in the example does have all valid spaces, but then confuses the parser by moving the characters closer to each other by some negative margin stuff, normal copy and paste into a txt file won't have that issue since it ignores the margin stuff.
If we are talking about a giant amount of data and you are willing to put some more time into this, you can check out some sources on Optical Character Recognition Post Correction (OCR Post Correction), which are models trying to fix text parsed with errors (although it usually focusses more on the issues of characters not being correctly identified through different fonts etc than on spacing issues).
CodePudding user response:
PyPDF2 is unmaintained since 2018.
The problem is because there a lot of pages recommending PyPDF2 over web but actually nobody uses it nowadays.
I recently did the same until realize PyPDF2 is dead. I ended up using https://github.com/jsvine/pdfplumber. Its is actively maintained, easy and performs very well