Home > Blockchain >  how to create a list of pdf page numbers if pdf page contains specific text strings using python
how to create a list of pdf page numbers if pdf page contains specific text strings using python

Time:05-14

I am trying to extract PDF page numbers if the page contains certain strings, and then append the selected page numbers to a list. For example, page 2, 254, 439 and 458 meet the criteria and I'm expecting the output as a list [2,254,439,458]. My code is:


object=PyPDF2.PdfFileReader(file_path)
NumPages = object.getNumPages()
String = 'specific string'
for i in range(0,NumPages):
  PageObj=object.getPage(i)
  Text = PageObj.extractText()
  ReSearch = re.search(String,Text)
  Pagelist=[]
  if ReSearch != None:
     Pagelist.append(i)
     print(Pagelist)

I received output as:

  • [2]
  • [254]
  • [439]
  • [458]

Could someone please take a look and see how I can fix it? Thank you

CodePudding user response:

Right now you are defining a new llst in every iteration, so you have to define the list only once, before the loop. Also print it outside the loop:

Pagelist=[]
for i in range(0,NumPages):
    # rest of the loop
print(Pagelist)
  • Related