Using C#, I am attempting to extract individual invoice line items from a block of text containing ALL the line items. For each line item, I want to separate and capture the Line Item Code, Line Item Description, and Line Item Dollar Amount. The issue is that many of the line item descriptions include decimal amounts similar to dollar amounts, so the regex I am using is capturing several entire line items into one line item description. How can I alter my regex statement to include these decimal numbers in the description, while still separating prices into another match group? I am also open to other optimization suggestions
Here is the block of line items that is giving me trouble:
1244 Drayage Charge MEDU2265085
1,875.00
4083 Chassis MEDU2265085 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MEDU2265085
250.00
1248 Truck Waiting & Over Time MEDU2265085 3.5*120
420.00
1244 Drayage Charge MEDU3325790
1,875.00
4083 Chassis MEDU3325790 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MEDU3325790
250.00
1248 Truck Waiting & Over Time MEDU3325790 2.38*120
285.60
1244 Drayage Charge MSCU3870551
1,875.00
4083 Chassis MSCU3870551 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MSCU3870551
250.00
1248 Truck Waiting & Over Time MSCU3870551 3.5*120
420.00
And here is my best attempt at a regex pattern:
(?<LINE_ITEM_CODE>[0-9]{4})[\r\s\n](?<LINE_ITEM_DESCRIPTION>[A-Za-z0-9\r\s\n\-\%\&\*\.]*)[\r\n\s](?<LINE_ITEM_AMOUNT>[0-9\,]{1,7}.[0-9]{2})
If you punch these in over at regexr.com or regexstorm.net, you'll see that several of the line items are being captured as a single line item description. The alternative I had been using previously did not accommodate the 3.5, 2.38 etc. How can I target the prices while still grouping the other decimals into the description?
I'm open to alternative solutions
CodePudding user response:
You can use
(?m)^(?<LINE_ITEM_CODE>\d{4})\s (?<LINE_ITEM_DESCRIPTION>.*?)\r?\n(?<LINE_ITEM_AMOUNT>\d{1,3}(?:,\d{3})*\.\d{2})
See the regex demo.
Details:
(?m)^
- a multiline flag that makes^
match start of a line(?<LINE_ITEM_CODE>\d{4})
- Group "LINE_ITEM_CODE": four digits\s
- one or more whitespaces (including newlines)(?<LINE_ITEM_DESCRIPTION>.*?)
- Group "LINE_ITEM_DESCRIPTION": any zero or more chars other than newline chars as few as possible\r?\n
- CRLF or LF(?<LINE_ITEM_AMOUNT>\d{1,3}(?:,\d{3})*\.\d{2})
- Group "LINE_ITEM_AMOUNT": one to three digits and then zero or more repetitions of a comma and three digits and then a dot and two digits. `