Unable to extract date of birth from a given random format-CodePudding

I have a set of text files from which I have to extract date of birth. The below code is able to extract date of birth from most of the text files but it is getting failed when given in the below format. May I know how could I extract DOB? The data is very much un-uniform and broken.

Code:

import re
str = """ This is python to extract date
D
.O.B.
: 
14 
J
u
n
e 

199
1
work in a team or as individual 
contributor.
And Name is: Zon; DOB: 12/23/
         1955  11/15/2014   11:53 AM"""

pattern = re.findall(r'.*?D.O.B.*?:\s ([\d]{1,2}\s(?:JAN|NOV|OCT|DEC|June)\s[\d]{4})', string)
pattern2 = re.findall(r'.*?DOB.*?:\s ([\d/] )', string)
print(pattern)
print(pattern2)`

Expected Output:

['14 June 1991']
['12/23/1955']

CodePudding user response：

Working with date time is always a nightmare for developers for many reasons. In your case, you are trying to extract the date of birth, which is specified with a prefix of DOB with or without separators.

I suggest not to use and maintain a lot of regexes in the code, since you said the date formats can vary. You can use a good library like python-dateutil install it from pypy like pip install python-dateutil

All you have to do is find a good candidate section of the text, and use the library to parse it. Eg., in your case, find the date containing section of text like

import re
from dateutil.parser import parse

in_str = """DOB: 14 June 1991
work in a team or as individual 
contributor"""

# find DOB prefixed string patterns
candidates = re.findall(r"D\.?O\.?B\.?:.*\d{4}\b", in_str)

#parse the dates from the candidates

parsed_dates = [parse(dt) for dt in candidates]

print(parsed_dates)

This will give you an output like

[datetime.datetime(1991, 6, 14, 0, 0)]

From here, you can manipulate or process them easily. Finding the date contained sections is again not a necessity for date parser to work, but that minimizes your work as well.

CodePudding user response：

For the first pattern, you can add matching optional whitespace chars between the single characters.

\bD\s*\.\s*O\s*\.\s*B[^:]*:\s (\d{1,2}\s*(?:JAN|NOV|OCT|DEC|J\s*u\s*n\s*e)(?:\s*\d){4})

Then in the match, remove the newlines.

See a regex demo and a Python demo.

For the second pattern, you can match optional whitespace chars around the / and then remove the whitespace chars from the matches.

\bDOB.*?:\s (\d\d\s*/\s*\d\d\s*/\s*\d{4})\b

See another regex demo and a Python demo.

For example

import re

pattern = r"\bDOB.*?:\s (\d\d\s*/\s*\d\d\s*/\s*\d{4})\b"

s = (" This is python to extract date\n"
            "D\n"
            ".O.B.\n"
            ": \n"
            "14 \n"
            "J\n"
            "u\n"
            "n\n"
            "e \n\n"
            "199\n"
            "1\n"
            "work in a team or as individual \n"
            "contributor.\n"
            "And Name is: Zon; DOB: 12/23/\n"
            "         1955  11/15/2014   11:53 AM")

res = [re.sub(r"\s ", "", s) for s in re.findall(pattern, s)]
print(res)

Output

['12/23/1955']

If there should not be a colon between DOB and matching the "date" part, you can also use a negated character class to exclude matching the colon instead of .*?

\bDOB[^:]*:\s (\d\d\s*/\s*\d\d\s*/\s*\d{4})\b

Regex demo

CodePudding user response：

I agree with @Kris that you should try to have as little regex to maintain as possible, and make them as simple as possible. You should also, as he suggested, divide your problem in 2 steps:

1/ extracting candidates
2/ parsing (using, for example dateutil.parser.parse)

step 1: extracting candidates

One solution for making regex patterns simpler is to manipulate the input string (if possible).

For example in your case, the difficulty comes from varying newlines and spaces. Taking back your example:

import re

s1 = """ This is python to extract date
D
.O.B.
: 
14 
J
u
n
e 

199
1
work in a team or as individual 
contributor.
And Name is: Zon; DOB: 12/23/
         1955  11/15/2014   11:53 AM"""

You can create s2 that removes new lines and spaces:

s2 = s.replace("\n", "").replace(" ", "")

Then your pattern becomes simpler:

pattern = re.compile(r"D\.?O\.?B\.?:(?P<date-of-birth>(.*?)(\d{4}))")

(see pattern explanation below)

Match the pattern with your simplified string:

matches = [m.group('date-of-birth') for m in pattern.finditer(s2) if m]

You get:

>>> print(matches)
['14June1991', '12/23/1955']

step 2: parsing candidates to date objects

@Kris suggestion works very well:

import dateutil
dobs = [dateutil.parser.parse(m) for m in matches]

You get your expected result:

>>> print(dobs)
[datetime.datetime(1991, 6, 14, 0, 0), datetime.datetime(1955, 12, 23, 0, 0)]

You can then use strftime if you want to make all your dates as pretty, standardized strings:

dobs_pretty = [d.strftime('%Y-%m-%d') for d in dobs]
print(dobs_pretty)
>>> ['1991-06-14', '1955-12-23']

Pattern explanation

D\.?O\.?B\.?: you look for "DOB", with or without periods (hence the ? operator)
(?P<date-of-birth>(.*?)(\d{4})): You capture everything on the right of "DOB" until you find 4 consecutive digits (representing the year). (.*?) captures everything "up until" (\d{4}) (the 4 consecutive digits)
?P<date-of-birth> allows you to name the captured group, making retrieving the date much easier. You simply put the group name (date-of-birth) in the group() method: m.group('date-of-birth')