I have developed a code which extracts text from pdf and use it to get data from it. But one pdf is giving me empty texts. I can open the pdf in acrobat reader and it works fine. My code works great with other pdfs, so I want to know what is causing this issue. I used pypdf2, pdfplumber but same result. So there must be something wrong with the file. Link to the file 'https://drive.google.com/file/d/1kNqWmf0zb_Q53WnKKZ817B7h9n5bRJ50/view?usp=sharing' heres my code
import PyPDF2
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
text = page.extract_text()
print(text)
I do a lot more than just this. But its just a glimpse
CodePudding user response:
The PDF is made of images, and doesn't contain any text :)
Cheers
CodePudding user response:
You need to distinguish 3 types of pdf files:
- Digitally-created pdf ( "pure" pdf) : Was created via software like Microsoft Word, Latex,... Text from those files can be read with PyPDF2 / Pymupdf / Tika / Pdfium. The mistakes here are mostly around whitespaces / ligatures / font encodings / text linearization.
- scanned Pdf: essentially those are just images. You need ocr software like tesseract to read text from images. This is prone to mistakes like confusing similar looking characters such as o / O / 0
- OCRed PDFs (layered pdfs) : the image is in the foreground, but a text layer is in the background. You can select and copy the text. PyPDF2 / Pymupdf / Tika / Pdfium can read the text in the background