I've got some text:
text = From: Mark Twain <[email protected]>
Sent: Wednesday, February 2, 2022 1:33 PM
To: Edgar Allen Poe <[email protected]>
Cc: Edgar Allen Poe1 <[email protected]>
Subject: Request
This is the body of the first mail. That is
some further text.
Od: Edgar Allen Poe <[email protected]>
Wysłano: środa, 2 lutego 2022 17:49
Do: Edgar Allen Poe <[email protected]>
DW: Edgar Allen Poe <[email protected]>
Temat: RE: issue 43165
Here is the second email.
Signature. Yours sincerly.
Best Regards
From: Mark Twain <[email protected]>
Sent: Wednesday, February 2, 2022 1:33 PM
To: Edgar Allen Poe1 <[email protected]>
Subject: Without carbon copy
This is the last mail without Cc.
Kind regards
Mark
I'm trying to parse out the messages within them. Ultimately I'd like to have a list or dictionary where I have the From and To, and then the message body with which to do some analysis on.
The problem: There are two languages in it (english and polish -> From/Od). How do I have to change the regex in order to recognize both langues?
Code below is from Cleaning email chain for text analysis python:
import re
from pprint import pprint
groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
d = {}
d['from'] = g[0].strip()
d['to'] = g[1].strip()
d['subject'] = g[2].strip()
d['message'] = g[3].strip()
emails.append(d)
pprint(emails)
How do I have to change the following part?
groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text,
I thouth something like the following should work but only two entries are created:
groups = re.findall(r'^(From:(.*?)|Od:(.*?))(Sent:(.*?)|Wysłano:(.*?))(To:(.*?)|Do:(.*?))(Subject:(.*?)$(.*?)(?=^From:|\Z)|Temat:(.*?)$(.*?)(?=^Od|\Z))', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
d = {}
d['From'] = g[0].strip()
d['Sent'] = g[1].strip()
d['To'] = g[2].strip()
d['Subject'] = g[3].strip()
d['Body'] = g[4].strip()
emails.append(d)
df = pd.DataFrame(emails)
df
CodePudding user response:
Since all the parts have different versions (language, casing, like Dw
and DW
), I'd suggest to create a list for each and connecting them with |
in non-capturing gorups (e.g. (?:PATTERN)
).
You also need to add a part for Sent
which is not included in the example liked. Since in the last mail the Cc is absent, I made that a optional group. The referecing is done by named groups (e.g. (?P<some_name>PATTERN)
).
For compactness I omitted some part of the mails in the code.
import re
pattern = r'^(?:From|Od):(?P<from>.*?)\n(?:Sent|Wysłano):(?P<sent>.*?)\n(?:To|Do):(?P<to>.*?)\n(?:(?:Cc|Dw|DW):(?P<cc>.*?)\n)?(?:Subject|Temat):(?P<subject>.*?)\n(?P<message>.*)$'
mails = ['''
From: Mark Twain <[email protected]>
Sent: Wednesday, February 2, 2022 1:33 PM
To: Edgar Allen Poe <[email protected]>
Cc: Edgar Allen Poe1 <[email protected]>
Subject: Request
This is the body of the first mail. That is
some further text.
''',
'''
Od: Edgar Allen Poe <[email protected]>
Wysłano: środa, 2 lutego 2022 17:49
...
Best Regards
''',
'''
From: Mark Twain <[email protected]>
...
This is the last mail without Cc.
Kind regards
Mark
''']
for mail in mails:
matches = re.search(pattern, mail, flags=re.DOTALL|re.M)
print('To:', matches.group('to'))
print('From:', matches.group('from'))
print('Subject:', matches.group('subject'))
print('Msg:', matches.group('message')[:50]) # cutoff for demonstration
print('***')
Output:
To: Edgar Allen Poe <[email protected]>
From: Mark Twain <[email protected]>
Subject: Request
Msg:
This is the body of the first mail. That is
so
***
To: Edgar Allen Poe <[email protected]>
From: Edgar Allen Poe <[email protected]>
Subject: RE: issue 43165
Msg:
Here is the second email.
Signature. Yours since
***
To: Edgar Allen Poe1 <[email protected]>
From: Mark Twain <[email protected]>
Subject: Without carbon copy
Msg:
This is the last mail without Cc.
Kind regards
***
CodePudding user response:
Others have given you a regex answer, but I find that maintaining a single large regex becomes harder over time, especially if you want to add more fields to it.
Instead of using a regex to match everything in one go, you could parse the text line-by-line. Here, I use io.StringIO
to iterate over the lines of text
This approach also has the advantage of not requiring all your fields to be in the same order, which is necessary for the regex approach shown in the other answer.
import io
def parse_emails(text):
all_emails = []
current_email = {}
for line in io.StringIO(text):
line = line.strip()
if line.startswith("From: ") or line.startswith("Od: "):
# From indicates the start of a new message
# If the current email is not an empty dict, append it to our list
if current_email:
# Before appending, join the lines of the body into a single string
current_email["body"] = "\n".join(current_email["body"]).strip()
all_emails.append(current_email)
current_email = {}
current_email["from"] = line.split(": ", 1)[1]
elif line.startswith("To: ") or line.startswith("Do: "):
current_email["to"] = line.split(": ", 1)[1]
elif line.startswith("Sent: ") or line.startswith("Wysłano: "):
current_email["sent"] = line.split(": ", 1)[1]
elif line.startswith("Cc: ") or line.startswith("DW: "):
current_email["cc"] = line.split(": ", 1)[1]
elif line.startswith("Subject: ") or line.startswith("Temat: "):
current_email["subject"] = line.split(": ", 1)[1]
# You can add more fields if you need to
else:
# Append each line of the body to a list, because a list is cheaper to append
# to than creating a whole new string. As seen above, we join
# the elements of the list into a single string before finalizing this email
current_email.setdefault("body", []).append(line)
# After the text has ended, append the last email to the list
if current_email:
current_email["body"] = "\n".join(current_email["body"]).strip()
all_emails.append(current_email)
return all_emails
To test this, let's use your text
:
text = """From: Mark Twain <[email protected]>
Sent: Wednesday, February 2, 2022 1:33 PM
To: Edgar Allen Poe <[email protected]>
Cc: Edgar Allen Poe1 <[email protected]>
Subject: Request
This is the body of the first mail. That is
some further text.
Od: Edgar Allen Poe <[email protected]>
Wysłano: środa, 2 lutego 2022 17:49
Do: Edgar Allen Poe <[email protected]>
DW: Edgar Allen Poe <[email protected]>
Temat: RE: issue 43165
Here is the second email.
Signature. Yours sincerly.
Best Regards
From: Mark Twain <[email protected]>
Sent: Wednesday, February 2, 2022 1:33 PM
To: Edgar Allen Poe1 <[email protected]>
Subject: Without carbon copy
This is the last mail without Cc.
Kind regards
Mark"""
d = parse_emails(text)
print(d)
Which gives us our three emails, parsed into dictionaries with the correct keys:
[{'from': 'Mark Twain <[email protected]>',
'sent': 'Wednesday, February 2, 2022 1:33 PM',
'to': 'Edgar Allen Poe <[email protected]>',
'cc': 'Edgar Allen Poe1 <[email protected]>',
'subject': 'Request',
'body': 'This is the body of the first mail. That is\nsome further text.'},
{'from': 'Edgar Allen Poe <[email protected]>',
'sent': 'środa, 2 lutego 2022 17:49',
'to': 'Edgar Allen Poe <[email protected]>',
'cc': 'Edgar Allen Poe <[email protected]>',
'subject': 'RE: issue 43165',
'body': 'Here is the second email.\n\nSignature. Yours sincerly.\nBest Regards'},
{'from': 'Mark Twain <[email protected]>',
'sent': 'Wednesday, February 2, 2022 1:33 PM',
'to': 'Edgar Allen Poe1 <[email protected]>',
'subject': 'Without carbon copy',
'body': 'This is the last mail without Cc.\nKind regards\n\nMark'}]