Home > front end >  Find a string in a pdf file using python
Find a string in a pdf file using python

Time:11-14

I want to search through a pdf file and find the page that contains a specific phrase. What I have so far is:

object = PyPDF2.PdfFileReader("20220625.pdf")
numpages = object.getNumPages()
string = "SYSTEM WIDE OUTLET SUMMARY"
for i in range(0, numpages):
    page = object.getPage(i)
    text = page.extractText()
    if text.find(string):
        print(i)

The output of this code is: 0 1 2 3 ...

I also used "if string in text" instead of "text.find(string)", but it did not work either.

This is surprising since this phrase only exists on page 77!

CodePudding user response:

Take a look at this page describing the str.find method, which you are using in text.find(string). The find method returns the index of the first occurrence of the specified value unless it cannot find that value in which case it returns -1. So the statement text.find(string) returns -1 when it cannot find the string.

Conditions in an if sentence will have the bool function applied to evaluate the truthiness. If you try to run bool(-1) in a console you will see that it evaluates to True.

So the condition in your if sentence will always evaluate to True, unless string is at the beginning of the page, in which case text.find(string) evaluates to 0 which is falsy.

Solution: Rather than using str.find to check for containment of a string, you should use the in operator i.e.:

...
if string in text:
    print(i)

CodePudding user response:

Change your condition test. The following code is going to work.

t = PyPDF2.PdfFileReader("20220625.pdf")
numpages = t.getNumPages()
string = "SYSTEM WIDE OUTLET SUMMARY"
for i in range(0, numpages):
    page = t.getPage(i)
    text = page.extractText()
    if string in text:  # changed here
        print(i)
  • Related