In my homework, I need to extract the first name, last name, ID code, phone number, date of birth and address of a person from a given string using Regex. The order of the parameters always remains the same. Each parameter requires a separate pattern.
Requirements are as follows:
- Both first and last names always begin with a capital letter followed by at least one lowercase letter.
- ID code is always 11 characters long and consists only of numbers.
- The phone number itself is a combination of 7-8 numbers. The phone number might be separated from the area code with a whitespace, but not necessarily. It is also possible that there is no area code at all.
- Date of birth is formatted as dd-MM-YYYY
- Address is everything else that remains.
I got the following patterns for each parameter:
str1 = "HeinoPlekk69712047623 3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
first_name_pattern = r"^[A-Z][a-z] "
last_name_pattern = r"[A-z][a-z] (?=[0-9])"
id_code_pattern = r"\d{11}(?=\ )"
phone_number_pattern = r"\ \d{3}?\s*\d{7,8}"
date_pattern = r"\d{1,2}\-\d{1,2}\-\d{1,4}"
address_pattern = r"[A-Z][a-z]*\s.*$"
first_name_match = re.findall(first_name_pattern, str1)
last_name_match = re.findall(last_name_pattern, str1)
id_code_match = re.findall(id_code_pattern, str1)
phone_number_match = re.findall(phone_number_pattern, str1)
date_match = re.findall(date_pattern, str1)
address_match = re.findall(address_pattern, str1)
So, given "HeinoPlekk69712047623 3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
, I get ['Heino'] ['Plekk'] ['69712047623'] [' 372 56887364' ] ['12-09-2020'] ['Tartu mnt 183,Tallinn,16881,Eesti']
, which suits me perfectly.
The problem starts when the area code is missing, because now id_code_pattern
can't find the id code because of (?=\ )
, and if one tries to use |\d{11}
(or) there is another problem because now it finds both id code and phone number (69712047623 and 37256887364). And how to improve phone_number_pattern
so that it finds only 7 or 8 digits of the phone number, I do not understand.
CodePudding user response:
A single expression with some well-crafted capture groups will help you immensely:
import re
str1 = "HeinoPlekk69712047623 3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
pattern = r"^(?P<first_name>[A-Z][a-z] )(?P<last_name>[A-Z][a-z] )(?P<id_code>\d{11})(?P<phone>(?:\ \d{3})?\s*\d{7,8})(?P<dob>\d{1,2}\-\d{1,2}\-\d{1,4})(?P<address>.*)$"
print(re.match(pattern, str1).groupdict())
Result:
{'first_name': 'Heino', 'last_name': 'Plekk', 'id_code': '69712047623', 'phone': ' 37256887364', 'dob': '12-09-2020', 'address': 'Tartu mnt 183,Tallinn,16881,Eesti'}