Just trying to work out how to mark something as a fail in pypdf2 if there is match on any page of a PDF doc. I have been using the below code which I have partly recycled and partly built. Problem is that is prints fail for every single line which I don't need. I am trying to change it to only print Fail once if there are no matches on any page.
import PyPDF2
import re
import os
#create filereader object to read the PDF using PyPDF2
object = PyPDF2.PdfFileReader("shopping.pdf")
NumPages = object.getNumPages()
print(f"This document has {NumPages} pages")
for i in range(0, NumPages):
page = object.getPage(i)
text = page.extractText()
for line in text.splitlines():
if re.match('milk', line):
print("Pass the keyword is matched on page " str(i), ": " line)
else:
print("Fail")
CodePudding user response:
re.match
only returns a match if it exists at the beginning of a string. What you're probably looking for is re.search
Documentation: https://docs.python.org/3/library/re.html#search-vs-match
CodePudding user response:
The solution is memorizing the match in a list instead of printing an immediate result. The print should be done only after reading all the file
# [...]
loi = [] # Lines of Interest
for i in range(0, NumPages):
page = object.getPage(i)
text = page.extractText()
for line in text.splitlines():
if re.match('milk', line):
loi.append(f'{i}:{line}')
# Result
if len(loi) > 0: # or greater than a threshold
print('Pass. The keyword is matched on the following pages:')
print('\n'.join(loi))
else:
print('Fail.')