I am trying to make an email scraper that scrapes through certain emails looking for values to store them in a CSV file. I have been trying a lot of things to get this problem solved but without success so far.
# Function to get email content part i.e its body part
def get_body(msg):
if msg.is_multipart():
return get_body(msg.get_payload(decode=True)).decode()
else:
return msg.get_payload(decode=True).decode()
# Function to search for a key value pair
def search(key, value, con):
result, data = con.search(None, key, '"{}"'.format(value))
return data
# Function to get the list of emails under this label
def get_emails(result_bytes):
print("get email")
msgs = [] # all the email data are pushed inside an array
for num in result_bytes[0].split():
typ, data = con.fetch(num, '(RFC822)')
msgs.append(data)
return msgs
# this is done to make SSL connection with GMAIL
con = imaplib.IMAP4_SSL(imap_url)
con.login(user, password)
con.select('Inbox')
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
for sent in msg:
if type(sent) is tuple:
print(msg)
# encoding set as utf-8
content = sent[1], 'utf-8'
data = str(content)
# Handling errors related to unicodenecode
try:
indexstart = data.find("span")
data2 = data[indexstart 5: len(data)]
indexend = data2.find("</div>")
# printtng the required content which we need
# to extract from our email i.e our body
waarde = data2[0: indexend]
test_naam_1 = waarde.split("Naam: ",1)[1]
echte_naam = test_naam_1.split("Email: ",-1)[0]
email_test = waarde.split("Email: ",1)[1]
echte_email = email_test.split("Tel nr.: ",-1)[0]
tel_test = waarde.split("Tel nr.: ",1)[1]
echte_tel = tel_test.split("Onderwerp: ",-1)[0]
subj_test = waarde.split("Onderwerp: ",1)[1]
echte_subj = subj_test.split("Bericht: ",-1)[0]
print("---ADRESGEGEVENS---")
print("---Naam: " echte_naam "---")
print("---Naam: " echte_email "---")
print("---Naam: " echte_tel "---")
print("---Naam: " echte_subj "---")
Now in my results I am still receiving these ugly line breaks which look as following in my markup:
[(b'12638 (RFC822 {1973}', b'MIME-Version: 1.0\r\nDate: Mon, 25 Oct 2021 16:41:46 0200\r\nMessage-ID: <CAJDn=xsVynQqp7BwYoGZB=v21-AAR5=xcMkQ8D2kXE7ZpYFNNQ@mail.example.com>\r\nSubject: TESTTITELPYTHON\r\nFrom: Patrick Merkx <[email protected]>\r\nTo: Patrick Merkx <[email protected]>\r\nContent-Type: multipart/alternative; boundary="00000000000042e6ae05cf2e5c7e"\r\n\r\n--00000000000042e6ae05cf2e5c7e\r\nContent-Type: text/plain; charset="UTF-8"\r\n\r\nContactformulier ingevuld door:\r\nNaam: Patrick Merkx\r\nEmail: [email protected]\r\nTel nr.: 0611381219\r\n\r\nOnderwerp: Nog een test\r\n\r\nBericht:\r\nBericht\r\n\r\n--00000000000042e6ae05cf2e5c7e\r\nContent-Type: text/html; charset="UTF-8"\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n<div dir=3D"ltr"><div><div dir=3D"ltr" class=3D"gmail_signature" data-smart=\r\nmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div d=\r\nir=3D"ltr"><div style=3D"font-stretch:normal;font-size:13.33px;line-height:=\r\n19.99px;background:none;border:0px rgb(34,34,34);width:600px;overflow:visib=\r\nle;min-height:0px;outline-width:0px"><span class=3D"gmail-il" style=3D"font=\r\n-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=\r\nngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=\r\nsmall">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=\r\n=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=\r\nil.com" target=3D"_blank" style=3D"font-size:small">[email protected]=\r\n</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =\r\n0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=\r\n"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=\r\n=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=\r\n:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=\r\nze:small">Bericht</span><br></div></div></div></div></div></div></div></div=\r\n></div>\r\n\r\n--00000000000042e6ae05cf2e5c7e--'), b')']
class=3D"gmail-il" style=3D"font=\r\n-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=\r\nngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=\r\nsmall">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=\r\n=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=\r\nil.com" target=3D"_blank" style=3D"font-size:small">[email protected]=\r\n</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =\r\n0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=\r\n"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=\r\n=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=\r\n:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=\r\nze:small">Bericht</span><br>
I have also tried stripping the body tag, decoding and have also been trying multiple solutions but unlucky so far. I can't seem to get these line breaks removed in any so far known way.
What am I doing wrong?
CodePudding user response:
You are looking at a MIME part with Content-Transfer-Encoding: quoted-printable
. The proper way to decode that is to traverse the MIME structure and interpret the parts as you go. But there is no need to do that explicitly; Python's email
library already does exactly that for you.
from email import message_from_bytes
from email.policy import default
...
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
for sent in msg:
if type(sent) is tuple:
msg = message_from_bytes(sent[1], policy=default)
Unfortunately, without examples of the MIME structures in these messages, I can't tell you exactly how to process the resulting message. Probably you have something like a "primary" MIME body part; msg.get_body(preferencelist=('html', 'plain'))
would pull that out, and get_content()
on the result would extract the actual body part.
The policy=default
keyword argument selects the email.message.EmailMessage
object class which was introduced in Python 3.6 over the legacy email.message.Message
object from older versions.
In some more detail, trying to decode raw email bodies as UTF-8 is very wrong. A typical MIME message has several parts, each of which might have a different encoding, and many of which certainly do not use UTF-8 as their encoding (though it's becoming more prevalent; but then, typically, the actual UTF-8 will be behind a content transfer encoding which shields it from damage during transport over routes which may not be 8-bit clean).