I'm attempting to take emails from 500 word documents, and use findall to extract them into excel. This is the code I have so far:
import pandas as pd
from docx.api import Document
import os
import re
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
output_path = 'C:\\Users\\user1\\test2'
writer = pd.ExcelWriter('{}/docx_emails.xlsx'.format(output_path),engine='xlsxwriter')
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
match = re.findall(r'[\w. -] @[\w-] \.[\w.-] ',wordDoc)
data.append(match)
df = pd.DataFrame(data)
df.to_excel(writer)
writer.save()
print(df)
and I'm getting an error showing:
TypeError Traceback (most recent call last)
Input In [6], in <cell line: 19>()
17 data = []
19 for wordDoc in worddocs_list:
---> 20 match = re.findall(r'[\w. -] @[\w-] \.[\w.-] ',wordDoc)
21 data.append(match)
24 df = pd.DataFrame(data)
File ~\anaconda3\lib\re.py:241, in findall(pattern, string, flags)
233 def findall(pattern, string, flags=0):
234 """Return a list of all non-overlapping matches in the string.
235
236 If one or more capturing groups are present in the pattern, return
(...)
239
240 Empty matches are included in the result."""
--> 241 return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
What am I doing wrong here?
Many thanks.
CodePudding user response:
Your wordDoc
variable doesn't contain a string, it contains a Document object. You need to look at the docx.api
documention to see how to get the body of the Word document as a string out of the object.
It looks like you first have to get the Paragraphs with wordDoc.paragraphs
and then ask each one for its text
, so maybe something like this?
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
And then use that as the string to match against:
match = re.findall(r'[\w. -] @[\w-] \.[\w.-] ', documentText)
If you're going to be using the same regular expression over and over, though, you should probably compile it into a Pattern object first instead of passing it as a string to findall
every time:
regex = re.compile(r'[\w. -] @[\w-] \.[\w.-] ')
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
match = regex.findall(documentText)