I'm trying to extract unique invoice ids from strings like this:
1) Payment of invoice nr.2021-3-5450
2) Invoice 2021 3 27 has been paid
Words can change, but the Invoice id format is always:
- YEAR-MONTH-CUSTOMER_ID, or
- YEAR MONTH CUSTOMER_ID
Customer_ID can be from 1 to 9999.
I have tried this:
m = re.search(r"\d ", s)
But it only returns 2021. Is there a way that I can capture all numbers in the above formats?
CodePudding user response:
Try this out in the Regex playground: Link
Regex: ([\d]{4})[- ](0?[1-9]|1[0-2])[- ](\d{1,4})\b
Explanation: Matches year as a 4-digit number, month as an integer between 1-12 (including leading zeros), and customer id as an integer from 0-9999; the values can be separated by either dashes or spaces. The groups will be captured as (year, month, customer_id)
in that order.
Python demo:
import re
from typing import Optional, NamedTuple
invoice_re = re.compile(r'([\d]{4})[- ](0?[1-9]|1[0-2])[- ](\d{1,4})\b')
# NamedTuple that contains the invoice data
Invoice = NamedTuple('Invoice', year=int, month=int, customer_id=int)
def parse_invoice(invoice: str) -> Optional[Invoice]:
"""Parse an invoice, and return a tuple of (year, month, customer_id)"""
result = invoice_re.search(invoice)
return Invoice(*map(int, result.groups())) if result else None
s1 = 'Payment of invoice nr.2021-3-5450'
s2 = 'Invoice 2021 3 27 has been paid'
print(parse_invoice(s1))
print(parse_invoice(s2))