Home > front end >  Use regex to extract recepient and sender from an email text in python
Use regex to extract recepient and sender from an email text in python

Time:01-25

I am learning regular expressions and am getting pretty frustrated with this. I have the following text:

From: sender name
To: the recepient
Subject: well done!
Body: lorem ipsum lorem ipsum

I am trying to extract the text in the lines "From" and "To". I wrote the following regex:

(^From: [a-zA-Z]*) |(^To: [a-zA-Z]*) |(^Subject: [a-zA-Z])

and I'm matching it using this code:

regex = re.compile(pattern, flags=re.IGNORECASE | re.MULTILINE)
result = regex.match(text).groups() 

but this is only matching the first line. I couldn't figure out what's wrong nor do I seem to understand how to write regular expressions correctly

CodePudding user response:

Trying to stay close to your approach, the pattern ^From: ([ a-zA-Z]*)\nTo: ([ a-zA-Z]*) results in:

>>> result
('sender name', 'the recepient')

Now, why doesn't your pattern work?

  1. (^From: [a-zA-Z]*) would never capture sender name because you're not allowing any whitespace with [a-zA-Z]
  2. Using the A|B pattern makes it so the engine matches either A OR B, so it wouldn't look for your To: pattern after encountering From:

CodePudding user response:

You are using an alternation | which matches one of the alternatives using re.match.

Also the character classes [a-zA-Z]* are optional and not matching spaces and [a-zA-Z] only matches a single character.

You can use 2 capture groups with a newline in between, and match From: and To: followed by the rest of the line.

import re

text = ("From: sender name\n"
            "To: the recepient\n"
            "Subject: well done!\n"
            "Body: lorem ipsum lorem ipsum")
regex = re.compile(r"^(From: .*)\n(To: .*)", flags=re.IGNORECASE | re.MULTILINE)
print(regex.match(text).groups()) 

Output

('From: sender name', 'To: the recepient')

CodePudding user response:

Your regex needs some work and there are multiple ways to get results, but if you're going to structure your regex in a similar way using the or character "|", then here is a good start for you:

import re

example_text = """
From: sender name
To: the recepient
Subject: well done!
Body: lorem ipsum lorem ipsum
"""

pattern = re.compile(r'^From: (. )|^To: (. )|^Subject: (. )', re.MULTILINE)
for match in pattern.finditer(example_text):
    print(match.group())

This will output:

From: sender name
To: the recepient
Subject: well done!

But you need to get an idea of expected input. Will there be spaces? What if there is no subject? I'll leave it to you to figure out what's best.

  •  Tags:  
  • Related