Home > other >  Python Regex to Extract Email Information
Python Regex to Extract Email Information

Time:10-04

I Have below data, from this I want to retrieve only the message boy part and remove all the info related to “Forward” header.  

---------------------- Forwarded by Phillip K Allen/HOU/ECT on 03/21/2000 
01:24 PM ---------------------------
 
 
Stephane Brodeur
03/16/2000 07:06 AM
To: Phillip K Allen/HOU/ECT@ECT
cc:  
Subject: Maps
 
As requested by John, here's the map and the forecast...
Call me if you have any questions (403) 974-6756.

  What I have tried so far is below regular expression.   matchObjj = re.search(r'(---.*?)Subject:', tmp_text, re.DOTALL)

When I print using below command

print( tmp_text[matchObjj.span()[1]:])

I get below output.

Maps
 
As requested by John, here's the map and the forecast...
Call me if you have any questions (403) 974-6756.

So basically the issue is that the regex is not stripping the complete line of “Subject:” and only the header Subject: is removed but the actual subject text is still there which in this case is “Maps”.   I want the regex to detect the text till end of Subject line and then remove it. Please share your thoughts.  

CodePudding user response:

The simplest way should be to change your regex to this:

r'(---.*?)Subject:[^\n]*\n'

This will make your match extend all the way to the next newline, making the end of its span the start of the next line.

CodePudding user response:

You can do this without regex by creating a list of sentences with splitlines and slicing this list from the Subject line:

text = '''---------------------- Forwarded by Phillip K Allen/HOU/ECT on 03/21/2000 
01:24 PM ---------------------------
 
 
Stephane Brodeur
03/16/2000 07:06 AM
To: Phillip K Allen/HOU/ECT@ECT
cc:  
Subject: Maps
 
As requested by John, here's the map and the forecast...
Call me if you have any questions'''

data = text.splitlines()
slice_idx = [i for i, s in enumerate(data) if s.startswith('Subject: ')][0]
body = '/n'.join(data[slice_idx 2:])

output:

As requested by John, here's the map and the forecast...
Call me if you have any questions

CodePudding user response:

There are more spaces after the subject line, or maybe there is \t separation for your case. You can try to match the case with two or more spaces. e.g.

regexEquation = "(---.*?)Subject:[^\n]*(\s) "

You can get help for matching more spaces from here or here.

**Output**: As requested by John, here's the map and the forecast...
Call me if you have any questions (403) 974-6756.
  • Related