I'm currently parsing data from PDFs and I'd like to get the name and amount in a simple format: [NAME] [AMOUNT]
NAME LAST
7 494 25 7 494 25 199 44
NAME LAST
4 488 00 4 488 00 109 07
NAME MIDDLE LAST
7 854 00 7 854 00 298 25
NAME LAST
494 23 494 23 12 01
NAME MIDDLE LAST
4 301 56 4 301 56 112 61
NAME M LAST
13 359 25 13 359 25 130 54
This data means the following:
[NAME] [M?] [LAST]
[TOTAL WAGES] [PIT WAGES] [PIT WITHHELD]
NAME LAST $7,494.25 $7,494.25 $199.44
NAME LAST $4,488.00 $4,488.00 $109.07
NAME MIDDLE LAST $7,854.00 $7,854.00 $298.25
NAME LAST $494.23 $494.23 $12.01
NAME MIDDLE LAST $4,301.56 $4,301.56 $112.61
NAME M LAST $13,359.25 $13,359.25 $130.54
I'd like a regex to detect the duplicate group of numbers so that it parses to this:
NAME LAST $7,494.25
NAME LAST $4,488.00
NAME MIDDLE LAST $7,854.00
NAME LAST $494.23
NAME MIDDLE LAST $4,301.56
NAME M LAST $13,359.25
Hopefully, that makes sense. Thanks
CodePudding user response:
Assuming that no-one in your organisation is making more than $1M or less than $1, this regex will do what you want:
*([a-z][a-z ] )\R ((\d )(?: (\d ))? (\d )) (?=\2).*
It looks for
- some number of spaces
- names (simplistically) with
[a-z][a-z ]
(captured in group 1) - newline characters (
\R
) - 2 or 3 sets of digits separated by spaces
((\d )(?: (\d ))? (\d ))
(captured overall in group 2, with individual groups of digits captured in groups 3, 4 and 5) - a space, followed by an assertion that group 2 is repeated
(?=\2)
- characters to match the rest of the string to end of line (may not be required, dependent on your application) (
.*
)
You can replace that with
$1 \$$3$4.$5
to get the following output for your sample data:
NAME LAST $7494.25
NAME LAST $4488.00
NAME MIDDLE LAST $7854.00
NAME LAST $494.23
NAME MIDDLE LAST $4301.56
NAME M LAST $13359.25
If you're using JavaScript, you need a couple of minor changes. In the regex, replace \R
with [\r\n]
as JavaScript doesn't recognise \R
. In the substitution, replace \$
with $$
.
If your regex flavour supports conditional replacements, you can add a ,
between the thousands and hundreds by checking if group 4 was part of the match:
$1 \$$3${4: ,}$4.$5
In this case the output is:
NAME LAST $7,494.25
NAME LAST $4,488.00
NAME MIDDLE LAST $7,854.00
NAME LAST $494.23
NAME MIDDLE LAST $4,301.56
NAME M LAST $13,359.25