Home > Blockchain >  Pass or fail test for PDF keyword match
Pass or fail test for PDF keyword match

Time:10-10

Just trying to work out how to mark something as a fail in pypdf2 if there is match on any page of a PDF doc. I have been using the below code which I have partly recycled and partly built. Problem is that is prints fail for every single line which I don't need. I am trying to change it to only print Fail once if there are no matches on any page.


import PyPDF2
import re
import os

#create filereader object to read the PDF using PyPDF2 
object = PyPDF2.PdfFileReader("shopping.pdf")

NumPages = object.getNumPages()

print(f"This document has {NumPages} pages")

for i in range(0, NumPages):
    page = object.getPage(i)
    text = page.extractText()
    for line in text.splitlines():
        if re.match('milk', line):
            
            print("Pass the keyword is matched on page "    str(i), ": "   line)
            
        
        else:
            print("Fail")

CodePudding user response:

re.match only returns a match if it exists at the beginning of a string. What you're probably looking for is re.search

Documentation: https://docs.python.org/3/library/re.html#search-vs-match

CodePudding user response:

The solution is memorizing the match in a list instead of printing an immediate result. The print should be done only after reading all the file

# [...]
loi = []  # Lines of Interest
for i in range(0, NumPages):
    page = object.getPage(i)
    text = page.extractText()
    for line in text.splitlines():
        if re.match('milk', line):
            loi.append(f'{i}:{line}')
    # Result
    if len(loi) > 0: # or greater than a threshold
        print('Pass. The keyword is matched on the following pages:')
        print('\n'.join(loi))
    else:
        print('Fail.')

  • Related