I'm trying to parse an email signature, that can contain multiple phone numbers in different formats. I've managed to come up with this Regex:

(?<![%_])\b(\ ?\d{1,}[\s.-]?\(?\d{1,}\)?[\s.-]?\(?\d{1,}\)?[\s.-]?\d{1,}[\s.-]?\d{1,}[\s.-]?\d{3,})

This matches almost any phone number. (The reason there are many digits classes is because the number can arrive like this: 86 21 3387 6 532). I need two things though.

1.I don't know how to force the match to come from the same line. For example:

Newtown Square, PA 19073
360.751.1471 Some Other Text Here

The regex matches 19073 360.751.1471 instead of 360.751.1471.

2.The are different types of numbers, and I want to catch each type with a different regex. I.E. Cell, Fax, Office, Mobile ETC.. Lets focus on Mobile. It can come in different formats each time like the following (I want to catch them all with the same regex):

M: 360.123.1471
M 360.123.1471
(M): 360.123.1471
(M):360.123.1471
M:360.123.1471

Every prefixcan be either lower or upper case.

The reason I need to capture them by the prefix, is because I get signatures such as this:

Jane Doe
Ocean Export Agent
Some Company, Inc.
Celebrating 100 years!
p:
410-123-3 123  m: 410-123-1234
a:
111 Cromwell Park Drive, Glen Burnie, MD 21061
w:
website.com   e: [email protected]

And I want to capture the mobile number out of it.

How can I change my regex to solve both cases?

CodePudding user response：

Your RegEx is catching the newline as a whitespace character. You can swap it out instead for the literal 'space' character, which won't match a newline.

Here's how it looks like fixed (replaced \s with , and wrapped it with the above start/end of line matching):

(?<![%_])\b(\ ?\d{1,}[ .-]?\(?\d{1,}\)?[ .-]?\(?\d{1,}\)?[ .-]?\d{1,}[ .-]?\d{1,}[ .-]?\d{3,})

Watch it in action here.

CodePudding user response：

When the regex gets too complex, because it tries to match to many cases, then often support by an algorithmic approach is needed.

Here the phone-number in different formats (using dot or dash as separator) might be good candidate for a regex.

But when adding the markers (in variations like prefix and suffix or upper-case and lower-case) the regex is getting more and more complex.

Also the line-breaks in a multi-line text might be hard for a regex to cover.

Use the markers for location, the regex for extraction

When we scan a text for phone-numbers, we can use markers (like in your case) to locate them first. Then in the next step we could parse the phone-number using a regex. The phone-number may be located before or after the marker if the marker is used as prefix or suffix respectively.

See following approach:

a set of marker strings used to narrow down the location
a regex for the phone-number to extract it

import re

text = '''M: 360.751.0001
M 360.751.0002
(M): 360.751.0003
(M):360.751.0004
M:360.751.0005
(Mobile): 360.751.0006
(Mobile):360.751.0007
(Mobile) 360.751.0008
Mobile: 360.751.0009
Mobile:360.751.0010
Mobile 360.751.0011
360.751.0012 Mobile
360.751.0013 (M)
360.751.0014 M'''

email_signature = '''
Jane Doe
Ocean Export Agent
Some Company, Inc.
Celebrating 100 years!
p:
410-123-3 001  m: 410-123-0002
a:
111 Cromwell Park Drive, Glen Burnie, MD 21061
w:
website.com   e: [email protected]
'''

# leading space for a suffix, trailing space for a prefix
phone_markers = {' M', 'M ', 'M:', '(M)', ' Mobile', 'Mobile ', 'Mobile:', '(Mobile)'}

def find_phone_numbers_marked(text, phone_markers):
    phone_numbers = []
    for line in text.split('\n'):
        found_marked = [(marker, line.find(marker)) for marker in 
phone_markers if line.find(marker) >= 0]
        for marker, position in found_marked:
            if marker.startswith(' '): 
                text_marked = line[:position]  # text before marker
            else:
                text_marked = line[position:]  # text after marker
            found_numbers = re.findall(r'\d{1,3}[.-]\d{1,3}[.-]\d{1,4}', text_marked)
            print(line, f"Marker '{marker}' at position {position}, found number:", found_numbers)
            phone_numbers.extend(found_numbers)
    return phone_numbers
 

total_lines = len(text.split('\n'))
print(f"== Searching in {total_lines} lines ..")
result = find_phone_numbers_marked(text, phone_markers)
print(f"== Found {len(result)} numbers:", result)

total_lines = len(email_signature.split('\n'))
print(f"== Searching in {total_lines} lines ..")
phone_markers.update([m.lower() for m in phone_markers])  # also include lowercase version of markers
result = find_phone_numbers_marked(email_signature, phone_markers)
print("== Found:", result)

Output:

== Searching in 14 lines ..
M: 360.751.0001 Marker 'M:' at position 0, found number: ['360.751.0001']
M 360.751.0002 Marker 'M ' at position 0, found number: ['360.751.0002']
(M): 360.751.0003 Marker '(M)' at position 0, found number: ['360.751.0003']
(M):360.751.0004 Marker '(M)' at position 0, found number: ['360.751.0004']
M:360.751.0005 Marker 'M:' at position 0, found number: ['360.751.0005']
(Mobile): 360.751.0006 Marker '(Mobile)' at position 0, found number: ['360.751.0006']
(Mobile):360.751.0007 Marker '(Mobile)' at position 0, found number: ['360.751.0007']
(Mobile) 360.751.0008 Marker '(Mobile)' at position 0, found number: ['360.751.0008']
Mobile: 360.751.0009 Marker 'Mobile:' at position 0, found number: ['360.751.0009']
Mobile:360.751.0010 Marker 'Mobile:' at position 0, found number: ['360.751.0010']
Mobile 360.751.0011 Marker 'Mobile ' at position 0, found number: ['360.751.0011']
360.751.0012 Mobile Marker ' Mobile' at position 12, found number: ['360.751.0012']
360.751.0012 Mobile Marker ' M' at position 12, found number: ['360.751.0012']
360.751.0013 (M) Marker '(M)' at position 13, found number: []
360.751.0014 M Marker ' M' at position 12, found number: ['360.751.0014']
== Found 14 numbers: ['360.751.0001', '360.751.0002', '360.751.0003', '360.751.0004', '360.751.0005', '360.751.0006', '360.751.0007', '360.751.0008', '360.751.0009', '360.751.0010', '360.751.0011', '360.751.0012', '360.751.0012', '360.751.0014']
== Searching in 12 lines ..
410-123-3 001  m: 410-123-0002 Marker 'm:' at position 15, found number: ['410-123-0002']
410-123-3 001  m: 410-123-0002 Marker ' m' at position 14, found number: ['410-123-3']
111 Cromwell Park Drive, Glen Burnie, MD 21061 Marker ' M' at position 37, found number: []
website.com   e: [email protected] Marker 'm ' at position 10, found number: []
== Found: ['410-123-0002', '410-123-3']

Not yet perfect, but could be refined in markers and regex.