Python multi-line regex-CodePudding

I am using pdfplumber.page.extract_text() to extract text from bank statements. The text seems to get extracted correctly but I am having trouble with the regex expression to extract the date, type, description and amount. but I cant figure out a clean way to capture multi-line descriptions. I would like to have the description text in the gold boxes grouped with the description text in the line before the gold box.

Regex Pattern

re.findall(r'(\d{2}\/\d{2})\s*([\w ]*)([$\d.,]*)(\s{2})([$\d.,]*).*\s(?=\w*)', text)

Regex Description

(\d{2}\/\d{2}) - Capture date
([\w ]*) - Capture description
([$\d.,]*) - Capture expense amount
([$\d.,]*) - Capture deposit amount
(?=\w*) - Positive Lookahead for any text below

Input

  0  0  $12,345.67 
08/27  DEBIT CARD PURCHASE XXXXXX 5541XXXXXX  $1.23  0  $123,456.78
RACETRAC467 00004671 PLEASANTVILLEPA
08/27  BANK FUNDS TRANSFER DB  $45.67  0  $124,816.32
TO SMITH,JOHN
SAVINGS #0001, CONF# 8675309
continued on next page>>>
 987654-3210
Page 1 of 11

Current Output

['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX  ', '$1.23', '  ', '0', '  $123,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB  ', '$45.67', '  ', '0', '  $124,816.32 ']

Desired Output

['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA  ', '$1.23', '  ', '0', '  $123,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309 ', '$45.67', '  ', '0', ' $124,816.32 ']

CodePudding user response：

You can add the description of the following lines (that for example do not start with a date or 'continued" or Page and a digit) to the description that you already have.

In your pattern you use [\w ]* but that can also only match spaces. If there should be at least a word character you can use \w[\w ]*

You can also omit the capture group in this part (\s{2}) as it will return an entry with spaces only.

(?P<date>\d{2}/\d{2})\s (?P<desc>\w[\w ]*)(?P<expense>\$[\d.,]*)\s{2}(?P<deposit>\d[\d.,]*)\s.*(?P<desc_more>(?:\n(?!\d \/\d|continued\b|Page\s \d).*)*)

The pattern matches:

(?P<date>\d{2}/\d{2}) Group date
\s Match 1 whitespace chars
(?P<desc>\w[\w ]*) Group desc match word chars and spaces
(?P<expense>\$[\d.,]*) Group expense Match $ and optional digits . or ,
\s{2} Match 2 whitespace chars
(?P<deposit>\d[\d.,]*) Group deposit Match a digits and optional digits . or ,
\s.* Match a single whitespace char and the rest of the line
(?P<desc_more> Group desc_more
- (?: Non capture group to match as a whole
  - \n(?!\d \/\d|continued\b|Page\s \d).* Match a newline, and the rest of the line if it does not start with a date like pattern or any of the other alternatives
- )* Close the non capture group and optionally repeat
) Close group desc_more

See a regex demo and a Python demo.

An example using named capture groups and the match.groupdict():

import re

pattern = r"(?P<date>\d{2}/\d{2})\s (?P<desc>\w[\w ]*)(?P<expense>\$[\d.,]*)\s{2}(?P<deposit>\d[\d.,]*)\s.*(?P<desc_more>(?:\n(?!\d \/\d|continued\b|Page\s \d).*)*)"

s = ("  0  0  $12,345.67 \n"
     "08/27  DEBIT CARD PURCHASE XXXXXX 5541XXXXXX  $1.23  0  $123,456.78\n"
     "RACETRAC467 00004671 PLEASANTVILLEPA\n"
     "08/27  BANK FUNDS TRANSFER DB  $45.67  0  $124,816.32\n"
     "TO SMITH,JOHN\n"
     "SAVINGS #0001, CONF# 8675309\n"
     "continued on next page>>>\n"
     " 987654-3210\n"
     "Page 1 of 11\n"
     "07/27  DEBIT CARD PURCHASE XXXXXX 6541XXXXXX  $2.23  0  $223,456.78")

matches = re.finditer(pattern, s)

for _, match in enumerate(matches):
    d = match.groupdict()
    d.update({'desc': re.sub(r"[^\S\n]*\n", " " , match.groupdict().get('desc')   match.groupdict().get('desc_more'))})
    del d["desc_more"]
    print(d)

Output

{'date': '08/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA', 'expense': '$1.23', 'deposit': '0'}
{'date': '08/27', 'desc': 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309', 'expense': '$45.67', 'deposit': '0'}
{'date': '07/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 6541XXXXXX  ', 'expense': '$2.23', 'deposit': '0'}