Home > Enterprise >  Regex for bank transaction parsing
Regex for bank transaction parsing

Time:02-27

How do I parse and extract the 4 important columns from a text table of the following format? These are bank transaction line items extracted from a PDF using Ruby's pdf-reader package - as you can see the spacing between columns is very irregular between various columns.

11/4                      Stripe Transfer St-XYZ Agnostic Computers                      582.30
11/4                      Recurring Payment authorized on 11/01 Digitalocean.Com                                           12.00
11/4                      Purchase authorized on 11/01 Google *Gsuite_Get                                                  24.00
11/4                      Purchase authorized on 11/02 Amazon Web Service                                                 460.15
11/4                      Purchase authorized on 11/02 Amazon Web Service                                                   8.07           2,903.09
11/5                      Recurring Payment authorized on 11/03 Atlassian                                                  15.00           2,888.09
11/6                      Recurring Payment authorized on 11/04 Pipedrive Inc NY NY                                        24.00           2,864.09
11/12                     Foobar Retail Dis 211011 ABCDEFGH                            8,031.44
11/12                     Wire Trans Svc Charge - Sequence: 999999999999 Srf#                                              45.00
11/12                     WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC                                             5,000.00           5,850.53
11/14                      Purchase authorized on 11/13 Microconf Microconf.Com MN                                            100.00           5,702.53

The above transactions are extracted from a bank PDF with the following visual layout Table column names

Need to parse the bold colums via a regexp:

  1. Date - dd/mm format - always present
  2. Check number - always empty and may be ignored (alphanumeric single word?)
  3. Description - Text with dates, numbers, special characters - always present
  4. Credits - currency amount (only for deposits)
  5. Debits - currency amount (only for payments)
  6. Balance - currency amount (appears sporadically, not important)

I have only been able to get as far as /^(\d{1,2}\/\d{1,2})\s /mg for extracting the mm/dd. Should I start chomping the amounts from the right, but then there are no clear seperator patterns!

CodePudding user response:

Presuming your target is as csv/spreadsheet entries enter image description here

It is best to tackle the task in stages, and my preferred target format is CSV for a spreadsheet

TL;DR see last comment

11/4                      Stripe Transfer St-XYZ Agnostic Computers                      582.30
11/4                      Recurring Payment authorized on 11/01 Digitalocean.Com                                                 12.00
11/4                      Purchase authorized on 11/01 Google *Gsuite_Get                                                        24.00
11/4                      Purchase authorized on 11/02 Amazon Web Service                                                       460.15
11/4                      Purchase authorized on 11/02 Amazon Web Service                                                         8.07           2,903.09
11/5                      Recurring Payment authorized on 11/03 Atlassian                                                        15.00           2,888.09
11/6                      Recurring Payment authorized on 11/04 Pipedrive Inc NY NY                                              24.00           2,864.09
11/12                     Foobar Retail Dis 211011 ABCDEFGH                            8,031.44
11/12                     Wire Trans Svc Charge - Sequence: 999999999999 Srf#                                                    45.00
11/12                     WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC                                                   5,000.00           5,850.53
11/14                      Purchase authorized on 11/13 Microconf Microconf.Com MN                                                  100.00           5,702.53

1st we can target the bigger gaps so pick a suitable width, don't worry about the staggers they will be resolved later.
 `            ??.??                           ` to become `            ??.??                           `
We either need to protect existing commas so replace those with another unused symbol say `~` or for currency best to delete them from between numbers.
Replace all line ends with dummy extensions, it will not matter if there are too many columns if not digits so use
`    ??.??    ??.??` (yes in this case we assume under 1000 and cannot use , # or *)

thus `11/4                      Stripe Transfer St-XYZ Agnostic Computers                      582.30`
becomes  `11/4                      Stripe Transfer St-XYZ Agnostic Computers                      582.30    ??.??    ??.??`

11/4                      Stripe Transfer St-XYZ Agnostic Computers                      582.30    ??.??    ??.??
11/4                      Recurring Payment authorized on 11/01 Digitalocean.Com            ??.??                                12.00    ??.??    ??.??
11/4                      Purchase authorized on 11/01 Google *Gsuite_Get            ??.??                                       24.00    ??.??    ??.??
11/4                      Purchase authorized on 11/02 Amazon Web Service            ??.??                                      460.15    ??.??    ??.??
11/4                      Purchase authorized on 11/02 Amazon Web Service            ??.??                                        8.07           2903.09    ??.??    ??.??
11/5                      Recurring Payment authorized on 11/03 Atlassian            ??.??                                       15.00           2888.09    ??.??    ??.??
11/6                      Recurring Payment authorized on 11/04 Pipedrive Inc NY NY            ??.??                             24.00           2864.09    ??.??    ??.??
11/12                     Foobar Retail Dis 211011 ABCDEFGH                            8031.44    ??.??    ??.??
11/12                     Wire Trans Svc Charge - Sequence: 999999999999 Srf#            ??.??                                   45.00    ??.??    ??.??
11/12                     WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC            ??.??                                  5000.00           5850.53    ??.??    ??.??
11/14                      Purchase authorized on 11/13 Microconf Microconf.Com MN            ??.??                           100.00           5702.53    ??.??    ??.??

Now we can target the remaining irregular white space so replace all larger spaces with 2 or 3 spaces as appropriate (usually 2 will do, but beware any description with double spaces.)

11/4   Stripe Transfer St-XYZ Agnostic Computers   582.30   ??.??  ??.??
11/4   Recurring Payment authorized on 11/01 Digitalocean.Com  ??.??   12.00   ??.??  ??.??
11/4   Purchase authorized on 11/01 Google *Gsuite_Get  ??.??   24.00   ??.??  ??.??
11/4   Purchase authorized on 11/02 Amazon Web Service  ??.??   460.15   ??.??  ??.??
11/4   Purchase authorized on 11/02 Amazon Web Service  ??.??   8.07  2903.09   ??.??  ??.??
11/5   Recurring Payment authorized on 11/03 Atlassian  ??.??   15.00  2888.09   ??.??  ??.??
11/6   Recurring Payment authorized on 11/04 Pipedrive Inc NY NY  ??.??   24.00  2864.09   ??.??  ??.??
11/12   Foobar Retail Dis 211011 ABCDEFGH   8031.44   ??.??  ??.??
11/12   Wire Trans Svc Charge - Sequence: 999999999999 Srf#  ??.??   45.00   ??.??  ??.??
11/12   WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC  ??.??   5000.00  5850.53   ??.??  ??.??
11/14   Purchase authorized on 11/13 Microconf Microconf.Com MN  ??.??   100.00  5702.53   ??.??  ??.??

Finally add headers, replace `   ` with comma separators and remove the ??.??

Date,Description,Credits,Debits,Balance,,,

11/4,Stripe Transfer St-XYZ Agnostic Computers,582.30,,
11/4,Recurring Payment authorized on 11/01 Digitalocean.Com,,12.00,,
11/4,Purchase authorized on 11/01 Google *Gsuite_Get,,24.00,,
11/4,Purchase authorized on 11/02 Amazon Web Service,,460.15,,
11/4,Purchase authorized on 11/02 Amazon Web Service,,8.07,2903.09,,
11/5,Recurring Payment authorized on 11/03 Atlassian,,15.00,2888.09,,
11/6,Recurring Payment authorized on 11/04 Pipedrive Inc NY NY,,24.00,2864.09,,
11/12,Foobar Retail Dis 211011 ABCDEFGH,8031.44,,
11/12,Wire Trans Svc Charge - Sequence: 999999999999 Srf#,,45.00,,
11/12,WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC,,5000.00,5850.53,,
11/14,Purchase authorized on 11/13 Microconf Microconf.Com MN,,100.00,5702.53,,

On import to the spreadsheet the headers and possibly currency need a style.

In hindsight I realised all you need to do is

  1. Remove commas
  2. Inject a dummy column 3 in big white spaces (even a single ~)
  3. Pare down spaces to 2x space then replace those 2 spaces with comma
  4. Remove dummy entry e.g. the ~
  5. Add header Date,Description,Credits,Debits,Balance

The rest will take care of itself.

CodePudding user response:

The spacing between the columns is irregular but always seems to be more than 2. In that case you can use 3 capture group and an optional 4th part with also a capture group for the Debits part.

^(\d{1,2}\/\d{1,2})\s{2,}(\S.*?)\s{2,}(\d{1,3}(?:,\d{3})*\.\d{2})(?:\s{2,}(\d{1,3}(?:,\d{3})*\.\d{2}))?

In parts the pattern matches:

  • ^ Start of string
  • (\d{1,2}\/\d{1,2})\s{2,} Capture group 1 Match 1,2 digits / 1,2 digits and 2 or more whitespace chars
  • (\S.*?)\s{2,} Capture group 2 Match at least a single non whitespace chars and as least as possible characters until the next occurrence of 2 or more whitespace chars
  • (\d{1,3}(?:,\d{3})*\.\d{2}) Capture group 3 Match the number format
  • (?: Non capture group
    • \s{2,} Match 2 or more whitespace chars
    • (\d{1,3}(?:,\d{3})*\.\d{2}) Capture group 4, match the number format
  • )? Close the non capture group and make it optional

See a rubular regex demo

  • Related