How to read a specific text from email body and save it to excel-CodePudding

Greetings for the day!

i am working on a new project to extract specific data mainly 1) Ticker 2) price 3) Date from email. now email could be of two types a) data comes in email body b) data comes as an attachment. currently i am focusing on the former, so far what i have a done i have created a sub-folder and named as Email_Snapper and set some rules so that whatever email i receive from x subjectline with gets in there.

Now my job is to retrieve specific values as mentioned above and store it in xlsx/csv file in C: drive. i have written the below code but i am literally new to regex family not sure how can this be done, any help on this would be greatly appreciated. The attached screenshot represents a sample format, we get the same ticker, price and date in different formats from different clients, i think i might have to write different functions (script) for each email.

import pyodbc
import win32com.client
import re
from datetime import datetime

EMAIL_ACCOUNT = '[email protected]'
EMAIL_SUBJ_SEARCH_STRING = 'QIC EUA Index Valuation'
EMAIL_CONTNT = {'Ticker': [], 'TickerLevel': [], 'DATE': []}

out_app = win32com.client.gencache.EnsureDispatch("Outlook.Application")
out_namespace = out_app.GetNamespace("MAPI")

# root_folder = out_namespace.Folders.Item(1)
root_folder = out_namespace.GetDefaultFolder(6)
# out_iter_folder = root_folder.Folders['Inbox'].Folders['Email_snapper']
out_iter_folder = root_folder.Folders['Email_snapper']

char_length_of_search_substring = len(EMAIL_SUBJ_SEARCH_STRING)

# Count all items in the sub-folder
item_count = out_iter_folder.Items.Count
Flag = False
cnt = 0
if out_iter_folder.Items.Count > 0:
    for i in range(item_count, 0, -1):
        message = out_iter_folder.Items[i]

        # Find only mail items and report, note, meeting etc items
        if '_MailItem' in str(type(message)):
            if EMAIL_SUBJ_SEARCH_STRING in message.Subject:
                Flag = True
                EMAIL_CONTNT['TickerLevel'].append(
                    message.Body[message.Body.find(r'SRAL'):message.Body.find(r'SRAL Ticker Level')].strip())

CodePudding user response：

This can be easily done with the inbuilt 'csv' library. Export the excel file as csv, then open it in python. You can read the contents with [DictReader][2]. Each line is treated as a dict and you can easily access any value with its column name.

Here is an example:

import csv

with open('input_file.csv', 'r') as inputfile:
   input_reader = csv.DictReader(inputfile)

   for line in input_reader:
       print(line["SRAL Ticker Level"])

You can save the contents as csv file with [DictWriter][3].

CodePudding user response：

There are several aspects when dealing with the Outlook object model.

First of all, use the Find/FindNext or Restrict methods of the Items class to get items that correspond to your search criteria instead of iterating over all items in the folder. Read more about them in the following articles:

Second, the following line of code uses the Body property which returns a plain text string:

message.Body[message.Body.find(r'SRAL'):message.Body.find(r'SRAL Ticker Level')].strip())

Instead, I'd suggest dealing with the HTMLBody property for extracting values more precisely. The HTMLBody property returns or sets a string representing the HTML body of the specified item. The HTMLBody property should be an HTML syntax string. Setting the HTMLBody property will always update the Body property immediately.

Also you may consider using the the Word editor. The WordEditor property of the Inspector class returns an instance of the Word Document which represents the message body. See Chapter 17: Working with Item Bodies for more information.