Good Morning,
I have downloaded my *.eml from my Gmail and wanted to extract the content of the email as text.
I used the following codes:
import email
from email import policy
from email.parser import BytesParser
filepath = 'Project\Data\Your GrabPay Wallet Statement for 15 Feb 2022.eml'
fp = open(filepath, 'rb')
msg = BytesParser(policy=policy.default).parse(fp)
text = msg.get_body(preferencelist=('plain')).get_content()
I am unable to extract the content of the email. The length of text is 0.
When I attempted to open the *.eml using Word/Outlook, I could see the content.
When I use a normal file handler to open it:
fhandle = open(filepath)
print(fhandle)
print(fhandle.read())
I get
<_io.TextIOWrapper name='Project\Data\Your GrabPay Wallet Statement for 15 Feb 2022.eml' mode='r' encoding='cp1252'>
And the contents look something like the one below:
Content-Transfer-Encoding: base64
Content-Type: text/html; charset=UTF-8
PCFET0NUWVBFIGh0bWwgUFVCTElDICItLy9XM0MvL0RURCBYSFRNTCAxLjAgVHJhbnNpdGlvbmFs
Ly9FTiIgImh0dHA6Ly93d3cudzMub3JnL1RSL3hodG1sMS9EVEQveGh0bWwxLXRyYW5zaXRpb25h
bC5kdGQiPgo8aHRtbCB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbCI CjxoZWFk
I might have underestimated the amount of codes needed to extract email body content from *eml to Python.
CodePudding user response:
I do not have access to your email, but I've been able to extract text from an email that I downloaded myself as a .eml
from google.
import email
with open('email.eml') as email_file:
email_message = email.message_from_file(email_file)
print(email_message.get_payload())
When working with files it is important to consider using context managers such as I did in my example because it ensures that files are properly cleaned up and file handles are closed when they are no longer needed.
I briefly read over https://docs.python.org/3/library/email.parser.html for additional information on how to achieve the intended goal.
CodePudding user response:
I realised the email is in multipart. So there is a need to get to the specific part, and decode the email. While doing do, it returns a chunk of HTML codes. To strip off the HTML codes and get plain-text, I used html2text.
import email
from email import policy
from email.parser import BytesParser
import html2text
filepath = 'Project\Data\Your GrabPay Wallet Statement for 15 Feb 2022.eml'
with open(filepath) as email_file:
email_message = email.message_from_file(email_file)
if email_message.is_multipart():
for part in email_message.walk():
#print(part.is_multipart())
#print(part.get_content_type())
#print()
message = str(part.get_payload(decode=True))
plain_message = html2text.html2text(message)
print(plain_message)
print()