Home > Software engineering >  Given a string, extract all the necessary information about the person
Given a string, extract all the necessary information about the person

Time:10-25

In my homework, I need to extract the first name, last name, ID code, phone number, date of birth and address of a person from a given string using Regex. The order of the parameters always remains the same. Each parameter requires a separate pattern.

Requirements are as follows:

  1. Both first and last names always begin with a capital letter followed by at least one lowercase letter.
  2. ID code is always 11 characters long and consists only of numbers.
  3. The phone number itself is a combination of 7-8 numbers. The phone number might be separated from the area code with a whitespace, but not necessarily. It is also possible that there is no area code at all.
  4. Date of birth is formatted as dd-MM-YYYY
  5. Address is everything else that remains.

I got the following patterns for each parameter:

str1 = "HeinoPlekk69712047623 3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
first_name_pattern = r"^[A-Z][a-z] "
last_name_pattern = r"[A-z][a-z] (?=[0-9])"
id_code_pattern = r"\d{11}(?=\ )"
phone_number_pattern = r"\ \d{3}?\s*\d{7,8}"
date_pattern = r"\d{1,2}\-\d{1,2}\-\d{1,4}"
address_pattern = r"[A-Z][a-z]*\s.*$"

first_name_match = re.findall(first_name_pattern, str1)
last_name_match = re.findall(last_name_pattern, str1)
id_code_match = re.findall(id_code_pattern, str1)
phone_number_match = re.findall(phone_number_pattern, str1)
date_match = re.findall(date_pattern, str1)
address_match = re.findall(address_pattern, str1)

So, given "HeinoPlekk69712047623 3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti", I get ['Heino'] ['Plekk'] ['69712047623'] [' 372 56887364' ] ['12-09-2020'] ['Tartu mnt 183,Tallinn,16881,Eesti'], which suits me perfectly.

The problem starts when the area code is missing, because now id_code_pattern can't find the id code because of (?=\ ), and if one tries to use |\d{11} (or) there is another problem because now it finds both id code and phone number (69712047623 and 37256887364). And how to improve phone_number_pattern so that it finds only 7 or 8 digits of the phone number, I do not understand.

CodePudding user response:

A single expression with some well-crafted capture groups will help you immensely:

import re
str1 = "HeinoPlekk69712047623 3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
pattern = r"^(?P<first_name>[A-Z][a-z] )(?P<last_name>[A-Z][a-z] )(?P<id_code>\d{11})(?P<phone>(?:\ \d{3})?\s*\d{7,8})(?P<dob>\d{1,2}\-\d{1,2}\-\d{1,4})(?P<address>.*)$"

print(re.match(pattern, str1).groupdict())

Repl.it | regex101

Result:

{'first_name': 'Heino', 'last_name': 'Plekk', 'id_code': '69712047623', 'phone': ' 37256887364', 'dob': '12-09-2020', 'address': 'Tartu mnt 183,Tallinn,16881,Eesti'}
  • Related