I am trying to extract all instances of the following blocks of text from a PDF:
Aug 12, 2022 Name of Place €11.22 €123.12
To: Long Name of Place
Card: 123456******1234
So far I have been able to write regular expressions to extract each section of text I require from the above i.e. Date, Name, Price, Card Number
# dates
reg_date = re.compile(r'((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}\, \d{4})')
# currency
reg_cur = re.compile(r'[\$\u20AC\u00A3]{1}\d \.?\d{0,2}')
# vendor
reg_vdr = re.compile(r'To\:(.*)')
# card
reg_crd = re.compile(r'Card\:(.*)')
However, I am struggling to extract the blocks of text. I tried the regex below but it is throwing an error (multiple repeat at position 77) which leads me to believe that the formatting is correct.
# block
reg_block = re.compile(r'(?<=(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}\, \d{4}(.*?)?=******\d{4})')
matches = reg_date.findall(raw['content'])
print(matches)
Am I overlooking something obvious in my final regex in terms of implementation? Could it be the line breaks causing this?
CodePudding user response:
The one problem with your regex leading to the error you are seeing is the ******, each * is trying to do a wildcard match of the proceeding wildcard -> "multiple repeat at position 77".
Also, Python does not support variable repetition inside of a lookbehind.
If you would like to capture the date, "name of place", currency 1, currency 2, long name of place, and card all in one go, this regex might work for you, or get you pointed in the right direction:
((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}\, \d{4} )(.*)([\$\u20AC\u00A3]{1}\d \.?\d{0,2}) ([\$\u20AC\u00A3]{1}\d \.?\d{0,2})[\r\n]*To:(.*)[\r\n]Card: (.*)
You could work around the no variable repetiion inside lookbehind by just not using lookbehind and using capture group 3 (in my example above), or using non capturing groups for everything except the one group you are looking for. This will just grab all text between the date and the currency into the only capture group without lookahead/lookbehind:
(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}\, \d{4} )(.*)(?:[\$\u20AC\u00A3]{1}\d \.?\d{0,2})