Cleaning email for text analysis python-CodePudding

I've got some text:

text = From: Mark Twain <[email protected]>
Sent: Wednesday, February 2, 2022 1:33 PM
To: Edgar Allen Poe <[email protected]>
Cc: Edgar Allen Poe1 <[email protected]>
Subject: Request

 

This is the body of the first mail. That is
some further text.


Od: Edgar Allen Poe <[email protected]>
Wysłano: środa, 2 lutego 2022 17:49
Do: Edgar Allen Poe <[email protected]>
DW: Edgar Allen Poe <[email protected]>
Temat: RE: issue 43165

Here is the second email.

Signature. Yours sincerly. 
Best Regards



From: Mark Twain <[email protected]>
Sent: Wednesday, February 2, 2022 1:33 PM
To: Edgar Allen Poe1 <[email protected]>
Subject: Without carbon copy
 

This is the last mail without Cc.
Kind regards

Mark

I'm trying to parse out the messages within them. Ultimately I'd like to have a list or dictionary where I have the From and To, and then the message body with which to do some analysis on.

The problem: There are two languages in it (english and polish -> From/Od). How do I have to change the regex in order to recognize both langues?

Code below is from Cleaning email chain for text analysis python:

import re
from pprint import pprint

groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
    d = {}
    d['from'] = g[0].strip()
    d['to'] = g[1].strip()
    d['subject'] = g[2].strip()
    d['message'] = g[3].strip()
    emails.append(d)

pprint(emails)

How do I have to change the following part?

groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text,

I thouth something like the following should work but only two entries are created:

groups = re.findall(r'^(From:(.*?)|Od:(.*?))(Sent:(.*?)|Wysłano:(.*?))(To:(.*?)|Do:(.*?))(Subject:(.*?)$(.*?)(?=^From:|\Z)|Temat:(.*?)$(.*?)(?=^Od|\Z))', text, flags=re.DOTALL|re.M)

emails = []

for g in groups:
    d = {}
    d['From'] = g[0].strip()
    d['Sent'] = g[1].strip()
    d['To'] = g[2].strip()   
    d['Subject'] = g[3].strip()
    d['Body'] = g[4].strip()
    emails.append(d)

df = pd.DataFrame(emails)
df

CodePudding user response：

Since all the parts have different versions (language, casing, like Dwand DW), I'd suggest to create a list for each and connecting them with | in non-capturing gorups (e.g. (?:PATTERN)).
You also need to add a part for Sent which is not included in the example liked. Since in the last mail the Cc is absent, I made that a optional group. The referecing is done by named groups (e.g. (?P<some_name>PATTERN)).
For compactness I omitted some part of the mails in the code.

import re


pattern = r'^(?:From|Od):(?P<from>.*?)\n(?:Sent|Wysłano):(?P<sent>.*?)\n(?:To|Do):(?P<to>.*?)\n(?:(?:Cc|Dw|DW):(?P<cc>.*?)\n)?(?:Subject|Temat):(?P<subject>.*?)\n(?P<message>.*)$'

mails = ['''
From: Mark Twain <[email protected]>
Sent: Wednesday, February 2, 2022 1:33 PM
To: Edgar Allen Poe <[email protected]>
Cc: Edgar Allen Poe1 <[email protected]>
Subject: Request

This is the body of the first mail. That is
some further text.
''',
'''
Od: Edgar Allen Poe <[email protected]>
Wysłano: środa, 2 lutego 2022 17:49
...
Best Regards
''',
'''
From: Mark Twain <[email protected]>
...
This is the last mail without Cc.
Kind regards

Mark
''']

for mail in mails:
    matches = re.search(pattern, mail, flags=re.DOTALL|re.M)
    print('To:', matches.group('to'))
    print('From:', matches.group('from'))
    print('Subject:', matches.group('subject'))
    print('Msg:', matches.group('message')[:50])  # cutoff for demonstration
    print('***')

Output:

To:  Edgar Allen Poe <[email protected]>
From:  Mark Twain <[email protected]>
Subject:  Request
Msg:

This is the body of the first mail. That is
so
***
To:  Edgar Allen Poe <[email protected]>
From:  Edgar Allen Poe <[email protected]>
Subject:  RE: issue 43165
Msg:
Here is the second email.

Signature. Yours since
***
To:  Edgar Allen Poe1 <[email protected]>
From:  Mark Twain <[email protected]>
Subject:  Without carbon copy
Msg:

This is the last mail without Cc.
Kind regards

***

CodePudding user response：

Others have given you a regex answer, but I find that maintaining a single large regex becomes harder over time, especially if you want to add more fields to it. Instead of using a regex to match everything in one go, you could parse the text line-by-line. Here, I use io.StringIO to iterate over the lines of text

This approach also has the advantage of not requiring all your fields to be in the same order, which is necessary for the regex approach shown in the other answer.

import io

def parse_emails(text):
    all_emails = []
    current_email = {}
    for line in io.StringIO(text):
        line = line.strip()
        if line.startswith("From: ") or line.startswith("Od: "):
            # From indicates the start of a new message
            # If the current email is not an empty dict, append it to our list
            if current_email: 
                # Before appending, join the lines of the body into a single string
                current_email["body"] = "\n".join(current_email["body"]).strip()
                all_emails.append(current_email)
                current_email = {}
            current_email["from"] = line.split(": ", 1)[1]
        elif line.startswith("To: ") or line.startswith("Do: "):
            current_email["to"] = line.split(": ", 1)[1]
        elif line.startswith("Sent: ") or line.startswith("Wysłano: "):
            current_email["sent"] = line.split(": ", 1)[1]
        elif line.startswith("Cc: ") or line.startswith("DW: "):
            current_email["cc"] = line.split(": ", 1)[1]
        elif line.startswith("Subject: ") or line.startswith("Temat: "):
            current_email["subject"] = line.split(": ", 1)[1]
        # You can add more fields if you need to
        else:
            # Append each line of the body to a list, because a list is cheaper to append 
            # to than creating a whole new string. As seen above, we join
            # the elements of the list into a single string before finalizing this email
            current_email.setdefault("body", []).append(line)


    # After the text has ended, append the last email to the list
    if current_email:
        current_email["body"] = "\n".join(current_email["body"]).strip()
        all_emails.append(current_email)

    return all_emails

To test this, let's use your text:

text = """From: Mark Twain <[email protected]>
Sent: Wednesday, February 2, 2022 1:33 PM
To: Edgar Allen Poe <[email protected]>
Cc: Edgar Allen Poe1 <[email protected]>
Subject: Request

 

This is the body of the first mail. That is
some further text.


Od: Edgar Allen Poe <[email protected]>
Wysłano: środa, 2 lutego 2022 17:49
Do: Edgar Allen Poe <[email protected]>
DW: Edgar Allen Poe <[email protected]>
Temat: RE: issue 43165

Here is the second email.

Signature. Yours sincerly. 
Best Regards



From: Mark Twain <[email protected]>
Sent: Wednesday, February 2, 2022 1:33 PM
To: Edgar Allen Poe1 <[email protected]>
Subject: Without carbon copy
 

This is the last mail without Cc.
Kind regards

Mark"""

d = parse_emails(text)

print(d)

Which gives us our three emails, parsed into dictionaries with the correct keys:

[{'from': 'Mark Twain <[email protected]>',
  'sent': 'Wednesday, February 2, 2022 1:33 PM',
  'to': 'Edgar Allen Poe <[email protected]>',
  'cc': 'Edgar Allen Poe1 <[email protected]>',
  'subject': 'Request',
  'body': 'This is the body of the first mail. That is\nsome further text.'},
 {'from': 'Edgar Allen Poe <[email protected]>',
  'sent': 'środa, 2 lutego 2022 17:49',
  'to': 'Edgar Allen Poe <[email protected]>',
  'cc': 'Edgar Allen Poe <[email protected]>',
  'subject': 'RE: issue 43165',
  'body': 'Here is the second email.\n\nSignature. Yours sincerly.\nBest Regards'},
 {'from': 'Mark Twain <[email protected]>',
  'sent': 'Wednesday, February 2, 2022 1:33 PM',
  'to': 'Edgar Allen Poe1 <[email protected]>',
  'subject': 'Without carbon copy',
  'body': 'This is the last mail without Cc.\nKind regards\n\nMark'}]