I wrote a program which extracts text from PDF documents. But one PDF file is giving me empty texts. I can open the PDF file in Acrobat Reader and it works fine. My code works great with other PDF files, so I want to know what is causing this issue. I used PyPDF2 and pdfplumber, but same result. So there must be something wrong with the file: https://drive.google.com/file/d/1kNqWmf0zb_Q53WnKKZ817B7h9n5bRJ50/view?usp=sharing'
My Code
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
text = page.extract_text()
print(text)
I do a lot more than just this. But its just a glimpse
CodePudding user response:
The PDF is made of images, and doesn't contain any text :)
Cheers
CodePudding user response:
You need to distinguish 3 types of pdf files:
- Digitally-created PDF (aka "pure" PDFs / Digitally-born PDFs): Was created via software like Microsoft Word, Latex,... Text from those files can be read with PyPDF2 / Pymupdf / Tika / Pdfium. The mistakes here are mostly around whitespaces / ligatures / font encodings / text linearization.
- Scanned PDF: essentially those are just images. You need Optical Character Recognition (OCR) software like tesseract to read text from images. This is prone to mistakes like confusing similar looking characters such as o / O / 0
- OCRed PDFs (layered PDFs): the image is in the foreground, but a text layer is in the background. You can select and copy the text. PyPDF2 / Pymupdf / Tika / Pdfium can read the text in the background
Tesseract is Open Source and used e.g. by OCRmyPDF