I have a set of pdf invoices that need to parsed into respectove columns/groups like below:
To add text,
Data
28/06/22 7694759.0 Service Rear Lift 660L Cardboard 1 $5.00 $5.00
30/06/22 7710113.0 Service Rear Lift 240L Co-Mingled Bin Empty 1 $10.25 $10.25
28/06/22 7694756.0 Service Rear Lift 660L Putrescible 2 $28.50 $57.00
30/06/22 7714583.0 Service Rear Lift 240L Putrescible 1 $13.91 $13.91
Site: 111947.021 Rental Charge (H8.0 x 1 / RI) GRN (x 5) 5 $10.00 $50.00
21/06/22 7648522.0 Futile Service Front Lift 1.5M mick-email 1 20 20
28/06/22 7686945.0 Removal Front Lift 4.5M Putrescible full on removal 2 100 200
03/06/22 7537009.0 Service Rear Lift 240L Putrescible JENNIFER 1 13.34 1.364
to be parsed into
Date Site Description Description_2 Qty Price Total
28/06/22 7694759.0 Service Rear Lift 660L Cardboard 1 5 5
30/06/22 7710113.0 Service Rear Lift 240L Co-Mingled Bin Empty 1 10.25 10.25
28/06/22 7694756.0 Service Rear Lift 660L Putrescible 2 28.5 57
30/06/22 7714583.0 Service Rear Lift 240L Putrescible 1 13.91 13.91
111947.021 Rental Charge (H8.0 x 1 / RI) GRN (x 5) 5 10 50
21/06/22 7648522.0 Futile Service Front Lift 1.5M mick-email 1 20 20
28/06/22 7686945.0 Removal Front Lift 4.5M Putrescible full on removal 2 100 200
03/06/22 7537009.0 Service Rear Lift 240L Putrescible JENNIFER 1 13.34 1.364
How can I perform the grouping in regex
?
CodePudding user response:
You can use the following regex:
^([\d\/] )?\s (\d \.\d )\s (\S (?:\s\S )*)\s (\S (?:\s\S )*)?\s*([\d\.] )\s ([\d\.] )\s ([\d\.] )$
Regex Explanation:
^
: start of string([\d\/] )?
: Group 1 (the date) - optional\s
: spaces(\d \.\d )
: Group 2 (the site)\s
: spaces(\S (?:\s\S )*)
: Group 3 (the description)\s
: spaces(\S (?:\s\S )*)?
: Group 4 (the description 2) - optional\s*
: spaces - optional([\d\.] )
: Group 5 (the quantity)\s
: spaces([\d\.] )
: Group 6 (the price)\s
: spaces([\d\.] )
: Group 7 (the total)$
: end of string
You can access the single groups by their group number.
Check the demo here.