Home > Mobile >  Regex, How to find a repeating group of random numbers?
Regex, How to find a repeating group of random numbers?

Time:06-21

I'm currently parsing data from PDFs and I'd like to get the name and amount in a simple format: [NAME] [AMOUNT]

 NAME LAST
7 494 25 7 494 25 199 44
 NAME LAST
4 488 00 4 488 00 109 07
 NAME MIDDLE LAST
7 854 00 7 854 00 298 25
 NAME LAST
494 23 494 23 12 01
 NAME MIDDLE LAST
4 301 56 4 301 56 112 61
 NAME M LAST
13 359 25 13 359 25 130 54

This data means the following:
[NAME] [M?] [LAST]
[TOTAL WAGES] [PIT WAGES] [PIT WITHHELD]
NAME LAST $7,494.25 $7,494.25 $199.44
NAME LAST $4,488.00 $4,488.00 $109.07
NAME MIDDLE LAST $7,854.00 $7,854.00 $298.25
NAME LAST $494.23 $494.23 $12.01
NAME MIDDLE LAST $4,301.56 $4,301.56 $112.61
NAME M LAST $13,359.25 $13,359.25 $130.54

I'd like a regex to detect the duplicate group of numbers so that it parses to this:
NAME LAST $7,494.25
NAME LAST $4,488.00
NAME MIDDLE LAST $7,854.00
NAME LAST $494.23
NAME MIDDLE LAST $4,301.56
NAME M LAST $13,359.25

Hopefully, that makes sense. Thanks

CodePudding user response:

Assuming that no-one in your organisation is making more than $1M or less than $1, this regex will do what you want:

 *([a-z][a-z ] )\R ((\d )(?: (\d ))? (\d )) (?=\2).*

It looks for

  • some number of spaces
  • names (simplistically) with [a-z][a-z ] (captured in group 1)
  • newline characters (\R )
  • 2 or 3 sets of digits separated by spaces ((\d )(?: (\d ))? (\d )) (captured overall in group 2, with individual groups of digits captured in groups 3, 4 and 5)
  • a space, followed by an assertion that group 2 is repeated (?=\2)
  • characters to match the rest of the string to end of line (may not be required, dependent on your application) (.*)

You can replace that with

$1 \$$3$4.$5

to get the following output for your sample data:

NAME LAST $7494.25
NAME LAST $4488.00
NAME MIDDLE LAST $7854.00
NAME LAST $494.23
NAME MIDDLE LAST $4301.56
NAME M LAST $13359.25

Demo on regex101

If you're using JavaScript, you need a couple of minor changes. In the regex, replace \R with [\r\n] as JavaScript doesn't recognise \R. In the substitution, replace \$ with $$.

Demo on regex 101

If your regex flavour supports conditional replacements, you can add a , between the thousands and hundreds by checking if group 4 was part of the match:

$1 \$$3${4: ,}$4.$5

In this case the output is:

NAME LAST $7,494.25
NAME LAST $4,488.00
NAME MIDDLE LAST $7,854.00
NAME LAST $494.23
NAME MIDDLE LAST $4,301.56
NAME M LAST $13,359.25

Demo on regex101

  • Related