Extracting Text from Gmail eml file using Python-CodePudding

Good Morning,

I have downloaded my *.eml from my Gmail and wanted to extract the content of the email as text.

I used the following codes:

import email
from email import policy
from email.parser import BytesParser

filepath = 'Project\Data\Your GrabPay Wallet Statement for 15 Feb 2022.eml'

fp = open(filepath, 'rb')
msg = BytesParser(policy=policy.default).parse(fp)
text = msg.get_body(preferencelist=('plain')).get_content()

I am unable to extract the content of the email. The length of text is 0.

When I attempted to open the *.eml using Word/Outlook, I could see the content.

When I use a normal file handler to open it:

fhandle = open(filepath)
print(fhandle)
print(fhandle.read())

I get

<_io.TextIOWrapper name='Project\Data\Your GrabPay Wallet Statement for 15 Feb 2022.eml' mode='r' encoding='cp1252'>

And the contents look something like the one below:

Content-Transfer-Encoding: base64
Content-Type: text/html; charset=UTF-8

PCFET0NUWVBFIGh0bWwgUFVCTElDICItLy9XM0MvL0RURCBYSFRNTCAxLjAgVHJhbnNpdGlvbmFs
Ly9FTiIgImh0dHA6Ly93d3cudzMub3JnL1RSL3hodG1sMS9EVEQveGh0bWwxLXRyYW5zaXRpb25h
bC5kdGQiPgo8aHRtbCB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbCI CjxoZWFk

I might have underestimated the amount of codes needed to extract email body content from *eml to Python.

CodePudding user response：

I do not have access to your email, but I've been able to extract text from an email that I downloaded myself as a .eml from google.

import email

with open('email.eml') as email_file:
    email_message = email.message_from_file(email_file)

print(email_message.get_payload())

When working with files it is important to consider using context managers such as I did in my example because it ensures that files are properly cleaned up and file handles are closed when they are no longer needed.

I briefly read over https://docs.python.org/3/library/email.parser.html for additional information on how to achieve the intended goal.

CodePudding user response：

I realised the email is in multipart. So there is a need to get to the specific part, and decode the email. While doing do, it returns a chunk of HTML codes. To strip off the HTML codes and get plain-text, I used html2text.

import email
from email import policy
from email.parser import BytesParser
import html2text

filepath = 'Project\Data\Your GrabPay Wallet Statement for 15 Feb 2022.eml'

with open(filepath) as email_file:
    email_message = email.message_from_file(email_file)
  
if email_message.is_multipart():
    for part in email_message.walk():
        #print(part.is_multipart())
        #print(part.get_content_type())
        #print()
        message = str(part.get_payload(decode=True))
        plain_message = html2text.html2text(message)
        print(plain_message)
        print()