I want to search through a pdf file and find the page that contains a specific phrase. What I have so far is:
object = PyPDF2.PdfFileReader("20220625.pdf")
numpages = object.getNumPages()
string = "SYSTEM WIDE OUTLET SUMMARY"
for i in range(0, numpages):
page = object.getPage(i)
text = page.extractText()
if text.find(string):
print(i)
The output of this code is: 0 1 2 3 ...
I also used "if string in text" instead of "text.find(string)", but it did not work either.
This is surprising since this phrase only exists on page 77!
CodePudding user response:
Take a look at this page describing the str.find
method, which you are using in text.find(string)
. The find
method returns the index of the first occurrence of the specified value unless it cannot find that value in which case it returns -1
. So the statement text.find(string)
returns -1
when it cannot find the string.
Conditions in an if
sentence will have the bool
function applied to evaluate the truthiness. If you try to run bool(-1)
in a console you will see that it evaluates to True
.
So the condition in your if
sentence will always evaluate to True
, unless string
is at the beginning of the page, in which case text.find(string)
evaluates to 0
which is falsy.
Solution: Rather than using str.find
to check for containment of a string, you should use the in
operator i.e.:
...
if string in text:
print(i)
CodePudding user response:
Change your condition test. The following code is going to work.
t = PyPDF2.PdfFileReader("20220625.pdf")
numpages = t.getNumPages()
string = "SYSTEM WIDE OUTLET SUMMARY"
for i in range(0, numpages):
page = t.getPage(i)
text = page.extractText()
if string in text: # changed here
print(i)