I was using PyPdf to read text from a pdf file. However pyPDF does not read the text in pdf line by line, Its reading in some haphazard manner. Putting new line somewhere when its not even present in the pdf.
import PyPDF2
pdf_path = r'C:\Users\PDFExample\Desktop\Temp\sample.pdf'
pdfFileObj = open(pdf_path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
page_nos = pdfReader.numPages
for i in range(page_nos):
# Creating a page object
pageObj = pdfReader.getPage(i)
# Printing Page Number
print("Page No: ",i)
# Extracting text from page
# And splitting it into chunks of lines
text = pageObj.extractText().split(" ")
# Finally the lines are stored into list
# For iterating over list a loop is used
for i in range(len(text)):
# Printing the line
# Lines are seprated using "\n"
print(text[i],end="\n\n")
print()
This gives me content as
Our Ref :
21
1
8
88
1
11
5
Name:
S
ky Blue
Ref 1 :
1
2
-
34
-
56789
-
2021/2
Ref 2:
F2021004
444
Amount:
$
1
00
.
11
...
Whereas expected was
Our Ref :2118881115 Name: Sky Blue Ref 1 :12-34-56789-2021/2 Ref 2:F2021004444
Amount: $100.11 Total Paid:$0.00 Balance: $100.11 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CLOSED Collector : Sunny Jane
Here is the link to the pdf file https://drive.internxt.com/s/file/a6ce09dd3b967bfc131a/a1f64430147399ab527527436e686b0ee67011e7248ec3cc834e233596e162cf
CodePudding user response:
I tried a different package called as pdfplumber. It was able to read the pdf line by line in exact way in which I wanted.
1. Install the package pdfplumber
pip install pdfplumber
2. Get the text and store it in some container
import pdfplumber
pdf_text = None
with pdfplumber.open(pdf_path) as pdf:
first_page = pdf.pages[0]
pdf_text = first_page.extract_text()