Home > database >  PyPdf does not read the pdf text line by line
PyPdf does not read the pdf text line by line

Time:06-21

I was using PyPdf to read text from a pdf file. However pyPDF does not read the text in pdf line by line, Its reading in some haphazard manner. Putting new line somewhere when its not even present in the pdf.

import PyPDF2
pdf_path = r'C:\Users\PDFExample\Desktop\Temp\sample.pdf'
pdfFileObj = open(pdf_path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
page_nos = pdfReader.numPages
for i in range(page_nos):
        # Creating a page object
        pageObj = pdfReader.getPage(i)
        # Printing Page Number
        print("Page No: ",i)
        # Extracting text from page
        # And splitting it into chunks of lines
        text = pageObj.extractText().split("  ")
        # Finally the lines are stored into list
        # For iterating over list a loop is used
        for i in range(len(text)):
                # Printing the line
                # Lines are seprated using "\n"
                print(text[i],end="\n\n")
        print()

This gives me content as

Our Ref :
21
1
8
88
1
11
5 
 
Name: 
S
ky Blue
 
 
Ref 1 :
1
2
-
34
-
56789
-
2021/2 
 
Ref 2:
F2021004
444
 

Amount: 
$
1
00
.
11
... 

Whereas expected was

Our Ref :2118881115 Name: Sky Blue Ref 1 :12-34-56789-2021/2 Ref 2:F2021004444
Amount: $100.11 Total Paid:$0.00 Balance: $100.11 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CLOSED Collector : Sunny Jane

Here is the link to the pdf file https://drive.internxt.com/s/file/a6ce09dd3b967bfc131a/a1f64430147399ab527527436e686b0ee67011e7248ec3cc834e233596e162cf

CodePudding user response:

I tried a different package called as pdfplumber. It was able to read the pdf line by line in exact way in which I wanted.

1. Install the package pdfplumber

pip install pdfplumber

2. Get the text and store it in some container

import pdfplumber 
pdf_text = None 
with pdfplumber.open(pdf_path) as pdf:
    first_page = pdf.pages[0]
    pdf_text  = first_page.extract_text()
  • Related