I'm trying to extract emails, and I'm getting a TypeError-CodePudding

I'm attempting to take emails from 500 word documents, and use findall to extract them into excel. This is the code I have so far:

import pandas as pd
from docx.api import Document
import os
import re

os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
output_path = 'C:\\Users\\user1\\test2'
writer = pd.ExcelWriter('{}/docx_emails.xlsx'.format(output_path),engine='xlsxwriter')


worddocs_list = []
for filename in list(os.listdir(path)):
    wordDoc = Document(os.path.join(path, filename))
    worddocs_list.append(wordDoc)

data = []    
    
for wordDoc in worddocs_list:
    match = re.findall(r'[\w. -] @[\w-] \.[\w.-] ',wordDoc)
    data.append(match)
   

df = pd.DataFrame(data)
df.to_excel(writer)
writer.save()

print(df)

and I'm getting an error showing:

TypeError                                 Traceback (most recent call last)
Input In [6], in <cell line: 19>()
     17 data = []    
     19 for wordDoc in worddocs_list:
---> 20     match = re.findall(r'[\w. -] @[\w-] \.[\w.-] ',wordDoc)
     21     data.append(match)
     24 df = pd.DataFrame(data)

File ~\anaconda3\lib\re.py:241, in findall(pattern, string, flags)
    233 def findall(pattern, string, flags=0):
    234     """Return a list of all non-overlapping matches in the string.
    235 
    236     If one or more capturing groups are present in the pattern, return
   (...)
    239 
    240     Empty matches are included in the result."""
--> 241     return _compile(pattern, flags).findall(string)

TypeError: expected string or bytes-like object

What am I doing wrong here?

Many thanks.

CodePudding user response：

Your wordDoc variable doesn't contain a string, it contains a Document object. You need to look at the docx.api documention to see how to get the body of the Word document as a string out of the object.

It looks like you first have to get the Paragraphs with wordDoc.paragraphs and then ask each one for its text, so maybe something like this?

documentText = '\n'.join([p.text for p in wordDoc.paragraphs])

And then use that as the string to match against:

match = re.findall(r'[\w. -] @[\w-] \.[\w.-] ', documentText)

If you're going to be using the same regular expression over and over, though, you should probably compile it into a Pattern object first instead of passing it as a string to findall every time:

regex = re.compile(r'[\w. -] @[\w-] \.[\w.-] ')
for filename in list(os.listdir(path)):
    wordDoc = Document(os.path.join(path, filename))
    documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
    match = regex.findall(documentText)