Home > Enterprise >  extract_text from PDF using pypdf2 giving blank page
extract_text from PDF using pypdf2 giving blank page

Time:06-29

I have developed a code which extracts text from pdf and use it to get data from it. But one pdf is giving me empty texts. I can open the pdf in acrobat reader and it works fine. My code works great with other pdfs, so I want to know what is causing this issue. I used pypdf2, pdfplumber but same result. So there must be something wrong with the file. Link to the file 'https://drive.google.com/file/d/1kNqWmf0zb_Q53WnKKZ817B7h9n5bRJ50/view?usp=sharing' heres my code

 import PyPDF2
 reader = PyPDF2.PdfReader(file)
 for page in reader.pages:
     text = page.extract_text()
     print(text)

I do a lot more than just this. But its just a glimpse

CodePudding user response:

The PDF is made of images, and doesn't contain any text :)

Cheers

CodePudding user response:

You need to distinguish 3 types of pdf files:

  • Digitally-created pdf ( "pure" pdf) : Was created via software like Microsoft Word, Latex,... Text from those files can be read with PyPDF2 / Pymupdf / Tika / Pdfium. The mistakes here are mostly around whitespaces / ligatures / font encodings / text linearization.
  • scanned Pdf: essentially those are just images. You need ocr software like tesseract to read text from images. This is prone to mistakes like confusing similar looking characters such as o / O / 0
  • OCRed PDFs (layered pdfs) : the image is in the foreground, but a text layer is in the background. You can select and copy the text. PyPDF2 / Pymupdf / Tika / Pdfium can read the text in the background
  • Related