Home > Net >  Text extraction libraries don't return text for non-empty page
Text extraction libraries don't return text for non-empty page

Time:06-30

I wrote a program which extracts text from PDF documents. But one PDF file is giving me empty texts. I can open the PDF file in Acrobat Reader and it works fine. My code works great with other PDF files, so I want to know what is causing this issue. I used PyPDF2 and pdfplumber, but same result. So there must be something wrong with the file: https://drive.google.com/file/d/1kNqWmf0zb_Q53WnKKZ817B7h9n5bRJ50/view?usp=sharing'

My Code

 from PyPDF2 import PdfReader
 reader = PdfReader("example.pdf")
 for page in reader.pages:
     text = page.extract_text()
     print(text)

I do a lot more than just this. But its just a glimpse

CodePudding user response:

The PDF is made of images, and doesn't contain any text :)

Cheers

CodePudding user response:

You need to distinguish 3 types of pdf files:

  • Digitally-created PDF (aka "pure" PDFs / Digitally-born PDFs): Was created via software like Microsoft Word, Latex,... Text from those files can be read with PyPDF2 / Pymupdf / Tika / Pdfium. The mistakes here are mostly around whitespaces / ligatures / font encodings / text linearization.
  • Scanned PDF: essentially those are just images. You need Optical Character Recognition (OCR) software like tesseract to read text from images. This is prone to mistakes like confusing similar looking characters such as o / O / 0
  • OCRed PDFs (layered PDFs): the image is in the foreground, but a text layer is in the background. You can select and copy the text. PyPDF2 / Pymupdf / Tika / Pdfium can read the text in the background

Tesseract is Open Source and used e.g. by OCRmyPDF

  • Related