I am parsing the content of a PDF with PDFMiner and sometimes, there is a line that is present and other time not. I am trying to express the optional line without any success. Here is a piece of code that shows the problem:
#!/usr/bin/python3
# coding=UTF8
import re
# Simulate reading text of a PDF file with PDFMiner.
pdfContent = """
Blah blah.
Date: 2022-01-31
Optional line here which sometimes does not show
Amount: 123.45
2: Blah blah.
"""
RE = re.compile(
r".*?"
"Date:\s (\S ).*?"
"(Optional line here which sometimes does not show){0,1}.*?"
"Amount:\s (?P<amount>\S )\n.*?"
, re.MULTILINE | re.DOTALL)
matches = RE.match(pdfContent)
date = matches.group(1)
optional = matches.group(2)
amount = matches.group("amount")
print(f"date = {date}")
print(f"optional = {optional}")
print(f"amount = {amount}")
The output is:
date = 2022-01-31
optional = None
amount = 123.45
Why is optional None? Notice that if I replace the {0,1}
with {1}
, it works! But, then the line is not optional anymore. I do use the non-greedy .*?
everywhere...
This is probably a duplicate, but I searched and searched and did not find my answer, thus this question.
CodePudding user response:
You can use re.search
(instead of re.match
) with
Date:\s (\S )(?:.*?(Optional line here which sometimes does not show))?.*?Amount:\s (?P<amount>\S )
See the regex demo.
In this pattern, .*?(Optional line here which sometimes does not show)?
({0,1}
= ?
) is wrapped with an optional non-capturing group, (?:...)?
, that must be tried at least once since ?
is a greedy quantifier.
In your code, you can use it as
RE = re.compile(
r"Date:\s (\S )(?:.*?"
r"(Optional line here which sometimes does not show))?.*?"
r"Amount:\s (?P<amount>\S )",
re.DOTALL)
matches = RE.search(pdfContent)
See the Python demo:
import re
pdfContent = "\n\nBlah blah.\n\nDate: 2022-01-31\n\nOptional line here which sometimes does not show\n\nAmount: 123.45\n\n2: Blah blah.\n"
RE = re.compile(
r"Date:\s (\S )(?:.*?"
r"(Optional line here which sometimes does not show))?.*?"
r"Amount:\s (?P<amount>\S )",
re.DOTALL)
matches = RE.search(pdfContent)
date = matches.group(1)
optional = matches.group(2)
amount = matches.group("amount")
print(f"date = {date}")
print(f"optional = {optional}")
print(f"amount = {amount}")
Output:
date = 2022-01-31
optional = Optional line here which sometimes does not show
amount = 123.45