Home > Software engineering >  How to get a panda data frame from an output?
How to get a panda data frame from an output?

Time:10-22

I am using the following code to retrieve the contents of the emails.From that I could extract the details of the emails.

for i in range(messages, messages-N, -1):
    # fetch the email message by ID
    res, msg = imap.fetch(str(i), "(RFC822)")
  
    for response in msg:
        if isinstance(response, tuple):
            # parse a bytes email into a message object
            msg = email.message_from_bytes(response[1])
            # decode the email subject
            subject, encoding = decode_header(msg["Subject"])[0]
            if isinstance(subject, bytes):
                # if it's a bytes, decode to str
                subject = subject.decode(encoding)
            # decode email sender
            From, encoding = decode_header(msg.get("From"))[0]
            if isinstance(From, bytes):
                From = From.decode(encoding)
            Date, encoding = decode_header(msg["Date"])[0]
            if isinstance(Date, bytes):
                # if it's a bytes, decode to str
                Date = Date.decode(encoding)
                
            print("Subject:", subject)
            print("From:", From)
            print("Date:", Date)
            # if the email message is multipart
            if msg.is_multipart():
                # iterate over email parts
                for part in msg.walk():
                    # extract content type of email
                    content_type = part.get_content_type()
                    content_disposition = str(part.get("Content-Disposition"))
                    try:
                        # get the email body
                        body = part.get_payload(decode=True).decode()
                    except:
                        pass
                    if content_type == "text/plain" and "attachment" not in content_disposition:
                        # print text/plain emails and skip attachments
                        print(body)
                    elif "attachment" in content_disposition:
                        # download attachment
                        print("Subject:","This Contains an Attachement")
                        
            else:
                # extract content type of email
                content_type = msg.get_content_type()
                # get the email body
                body = msg.get_payload(decode=True).decode()
                if content_type == "text/plain":
                    # print only text email parts
                    print(body)
            if content_type == "text/html":
                   print("Content Type is HTML")
            print("="*100)

But I need to retrieve the,

            print("Subject:", subject)
            print("From:", From)
            print("Date:", Date)

In to a data frame. How should I improve this code? I need the whole output list to be in a data frame.

CodePudding user response:

If you can use a temporary file or any kind of storage you can write the results to that storage and use the result to get the desired dataframe.

If the email amount we are talking about is really small you don't have to optimize anything and just concatenate every single row to a dataframe but this is bad practice and if possible you should avoid this. When the email amount is large this would cause all kinds of problems. First of all it would be extremely slow. Write the results to a csv or a sql based database and you'll thank yourself later.

CodePudding user response:

I guess I understand your problem, subject is a str and not a list. You have to learn how to collect the data in a list and pass this list to the pandas DataFrame. Then you can follow my comment and create a DataFrame using pandas. Please check the rough prototype below:

import pandas as pd

subject_list = []
from_list = []
date_list = []
for i in range(...):
    # replace prints with list.append()

    # print("Subject:", subject)
    # print("From:", From)
    # print("Date:", Date)
    subject_list.append(subject)
    from_list.append(From)
    date_list.append(Date)

df = pd.DataFrame({"Subject": subject_list, "From": from_list, "Date": date_list})
print(df)
  • Related