Home > Blockchain >  need to find python regex to get only the last digit
need to find python regex to get only the last digit

Time:08-16

I have a huge pdf that is all very basic text on pages for invoices, I need to create a regex or 2 so when I split it I get the customer number and the invoice number to use in the file name. I am using python 3 and pypdf2 currently

text example of 2 of the pages:

Detailed Invoice Report
Starting 8/12/2015 and ending 8/11/2022
Company:  (Multiple Companies) Printed by Robert S on 8/11/2022   1:26:46PM
Donna Contact Cust# Name: Customer A  1234
Customer A Invoice Date Invoice Name 8/12/2015  241849
Item Description Qty Price Extended Price
Credit ($810.00)  1 ($810.00) 1
Due Paid Total Total Taxes Subtotal
($810.00) ($810.00) $0.00 ($810.00)
Balance: ($810.00) $0.00 $0.00 
8/11/2022   1:26:46PM Page 1 of 340977

Detailed Invoice Report
Starting 8/12/2015 and ending 8/11/2022
Company:  (Multiple Companies) Printed by Robert S on 8/11/2022   1:26:46PM
Customer B Cust# Name: Customer B  45678
Customer B Invoice Date Invoice Name 8/12/2015  241850
Item Description Qty Price Extended Price
credit ($49.99)  1 ($49.99) 1
Due Paid Total Total Taxes Subtotal
($49.99) ($49.99) $0.00 ($49.99)
Balance: ($49.99) $0.00 $0.00 
8/11/2022   1:26:46PM Page 2 of 340977

currently I have these 2 regex filters to get each one kind of but I do not know how to only keep the last groups match from them. Note: the firstmatch regex is broken if the customer name has a number in it which is an edge case but not uncommon in the data

firstmatch=r"(Name:)(\D*)(\d )"
secondmatch=r"(Name )(\d*.\d*.\d*..)(\d*)"

Each one is its own page and I would like the regex to be able to pull from the first one 1234 241849 and the second one 45678 241850

CodePudding user response:

You could get both values using a capture matching the last digits on the line.

For the first pattern:

\bName:.*?\b(\d )[^\d\n]*$

Explanation

  • \bName: Match Name: preceded by a word boundary
  • .*? Match any character without a newline, as least as possible
  • \b(\d ) A word boundary, then capture 1 digits in group 1
  • [^\d\n]* Optionally match any character except digits or a newline
  • $ End of string

Regex demo

For the second pattern you can make it a bit more specific, where [^\S\n] matches 1 whitespace chars without newlines:

\bName[^\S\n] \d /\d /\d [^\S\n] (\d )[^\d\n]*$

Regex demo

Or if the lines are right behind each other, you can also use 1 pattern with 2 capture groups and match the newline at the end of the first line:

\bName:.*?\b(\d )[^\d\n]*\n\b.*?Name[^\S\n] \d /\d /\d [^\S\n] (\d )[^\d\n]*$

Regex demo

  • Related