Good day, Thank you in advance for any response to my query. Although my current reputation here is too low to upvote any response I promise to return here to do so when I am finally able.
I am trying to make the below code loop through all the pdfs in my directory, extract the text from these pdfs and print them in once block using the code below. I am currently getting stuck in a forever while loop. Additionally how can my code be modified to perform the same function using the for loop?
I am quite not an advanced python user, but previous responses to my questions have helped immensely.
'''
import glob
import PyPDF2
pdfs=glob.glob("/private/babik/*.pdf")
file_name = "Announcement"
index = 0
while index<=len(pdfs):
pdfFileObj = open(str(pdfs[index]), 'rb')
# creating a pdf reader objecct
pdfReader = PyPDF2.PdfFileReader(pdfFileObj,strict=False)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()
index =1
'''
CodePudding user response:
you are not increasing the "index" in the while loop, you shuuld write
index = 0
while index<=len(pdfs):
pdfFileObj = open(str(pdfs[index]), 'rb')
index=index 1
or alternatively you can use the for loop in this way, iterating directly on the pdfs list
for pdf in pdfs:
pdfFileObj = open(str(pdf), 'rb')
CodePudding user response:
Your while
loop contains a single line pdfFileObj = open(str(pdfs[index]), 'rb')
which does not increment index
. Since index
never changes, the while
never terminates.
Python's for
loop is a better way to process the items of a list. You could rewrite your code to get rid of index
completely.
import glob
import PyPDF2
pdfs=glob.glob("/private/babik/*.pdf")
for pdf in pdfs:
with open(pdf, 'rb') as pdfFileObj:
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj,strict=False)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
CodePudding user response:
In Python spaces at the beginning of line impact the way the code will be executed.
You have to reformat the spacing of your code to get out of the forever loop indenting all lines after while index<=len(pdfs):
by four spaces (four spaces is the Python standard indentation).
You need indentation of lines after the :
of for, while, if, ... to indicate which lines are part of the for, while, if, ... block.
And if you don't need the indices to index some another list as these one you loop over use always a for
loop instead of a while
one as suggested in the answer by tdelaney.