I'm trying to parse an email signature, that can contain multiple phone numbers in different formats. I've managed to come up with this Regex:
(?<![%_])\b(\ ?\d{1,}[\s.-]?\(?\d{1,}\)?[\s.-]?\(?\d{1,}\)?[\s.-]?\d{1,}[\s.-]?\d{1,}[\s.-]?\d{3,})
This matches almost any phone number. (The reason there are many digits classes is because the number can arrive like this: 86 21 3387 6 532
). I need two things though.
1.I don't know how to force the match to come from the same line. For example:
Newtown Square, PA 19073
360.751.1471 Some Other Text Here
The regex matches 19073 360.751.1471
instead of 360.751.1471
.
2.The are different types of numbers, and I want to catch each type with a different regex. I.E. Cell, Fax, Office, Mobile ETC.. Lets focus on Mobile
. It can come in different formats each time like the following (I want to catch them all with the same regex):
M: 360.123.1471
M 360.123.1471
(M): 360.123.1471
(M):360.123.1471
M:360.123.1471
Every prefixcan be either lower or upper case.
The reason I need to capture them by the prefix, is because I get signatures such as this:
Jane Doe
Ocean Export Agent
Some Company, Inc.
Celebrating 100 years!
p:
410-123-3 123 m: 410-123-1234
a:
111 Cromwell Park Drive, Glen Burnie, MD 21061
w:
website.com e: [email protected]
And I want to capture the mobile number out of it.
How can I change my regex to solve both cases?
CodePudding user response:
Your RegEx is catching the newline as a whitespace character. You can swap it out instead for the literal 'space' character, which won't match a newline.
Here's how it looks like fixed (replaced \s
with
, and wrapped it with the above start/end of line matching):
(?<![%_])\b(\ ?\d{1,}[ .-]?\(?\d{1,}\)?[ .-]?\(?\d{1,}\)?[ .-]?\d{1,}[ .-]?\d{1,}[ .-]?\d{3,})
CodePudding user response:
When the regex gets too complex, because it tries to match to many cases, then often support by an algorithmic approach is needed.
Here the phone-number in different formats (using dot or dash as separator) might be good candidate for a regex.
But when adding the markers (in variations like prefix and suffix or upper-case and lower-case) the regex is getting more and more complex.
Also the line-breaks in a multi-line text might be hard for a regex to cover.
Use the markers for location, the regex for extraction
When we scan a text for phone-numbers, we can use markers (like in your case) to locate them first. Then in the next step we could parse the phone-number using a regex. The phone-number may be located before or after the marker if the marker is used as prefix or suffix respectively.
See following approach:
- a set of marker strings used to narrow down the location
- a regex for the phone-number to extract it
import re
text = '''M: 360.751.0001
M 360.751.0002
(M): 360.751.0003
(M):360.751.0004
M:360.751.0005
(Mobile): 360.751.0006
(Mobile):360.751.0007
(Mobile) 360.751.0008
Mobile: 360.751.0009
Mobile:360.751.0010
Mobile 360.751.0011
360.751.0012 Mobile
360.751.0013 (M)
360.751.0014 M'''
email_signature = '''
Jane Doe
Ocean Export Agent
Some Company, Inc.
Celebrating 100 years!
p:
410-123-3 001 m: 410-123-0002
a:
111 Cromwell Park Drive, Glen Burnie, MD 21061
w:
website.com e: [email protected]
'''
# leading space for a suffix, trailing space for a prefix
phone_markers = {' M', 'M ', 'M:', '(M)', ' Mobile', 'Mobile ', 'Mobile:', '(Mobile)'}
def find_phone_numbers_marked(text, phone_markers):
phone_numbers = []
for line in text.split('\n'):
found_marked = [(marker, line.find(marker)) for marker in
phone_markers if line.find(marker) >= 0]
for marker, position in found_marked:
if marker.startswith(' '):
text_marked = line[:position] # text before marker
else:
text_marked = line[position:] # text after marker
found_numbers = re.findall(r'\d{1,3}[.-]\d{1,3}[.-]\d{1,4}', text_marked)
print(line, f"Marker '{marker}' at position {position}, found number:", found_numbers)
phone_numbers.extend(found_numbers)
return phone_numbers
total_lines = len(text.split('\n'))
print(f"== Searching in {total_lines} lines ..")
result = find_phone_numbers_marked(text, phone_markers)
print(f"== Found {len(result)} numbers:", result)
total_lines = len(email_signature.split('\n'))
print(f"== Searching in {total_lines} lines ..")
phone_markers.update([m.lower() for m in phone_markers]) # also include lowercase version of markers
result = find_phone_numbers_marked(email_signature, phone_markers)
print("== Found:", result)
Output:
== Searching in 14 lines ..
M: 360.751.0001 Marker 'M:' at position 0, found number: ['360.751.0001']
M 360.751.0002 Marker 'M ' at position 0, found number: ['360.751.0002']
(M): 360.751.0003 Marker '(M)' at position 0, found number: ['360.751.0003']
(M):360.751.0004 Marker '(M)' at position 0, found number: ['360.751.0004']
M:360.751.0005 Marker 'M:' at position 0, found number: ['360.751.0005']
(Mobile): 360.751.0006 Marker '(Mobile)' at position 0, found number: ['360.751.0006']
(Mobile):360.751.0007 Marker '(Mobile)' at position 0, found number: ['360.751.0007']
(Mobile) 360.751.0008 Marker '(Mobile)' at position 0, found number: ['360.751.0008']
Mobile: 360.751.0009 Marker 'Mobile:' at position 0, found number: ['360.751.0009']
Mobile:360.751.0010 Marker 'Mobile:' at position 0, found number: ['360.751.0010']
Mobile 360.751.0011 Marker 'Mobile ' at position 0, found number: ['360.751.0011']
360.751.0012 Mobile Marker ' Mobile' at position 12, found number: ['360.751.0012']
360.751.0012 Mobile Marker ' M' at position 12, found number: ['360.751.0012']
360.751.0013 (M) Marker '(M)' at position 13, found number: []
360.751.0014 M Marker ' M' at position 12, found number: ['360.751.0014']
== Found 14 numbers: ['360.751.0001', '360.751.0002', '360.751.0003', '360.751.0004', '360.751.0005', '360.751.0006', '360.751.0007', '360.751.0008', '360.751.0009', '360.751.0010', '360.751.0011', '360.751.0012', '360.751.0012', '360.751.0014']
== Searching in 12 lines ..
410-123-3 001 m: 410-123-0002 Marker 'm:' at position 15, found number: ['410-123-0002']
410-123-3 001 m: 410-123-0002 Marker ' m' at position 14, found number: ['410-123-3']
111 Cromwell Park Drive, Glen Burnie, MD 21061 Marker ' M' at position 37, found number: []
website.com e: [email protected] Marker 'm ' at position 10, found number: []
== Found: ['410-123-0002', '410-123-3']
Not yet perfect, but could be refined in markers and regex.