I'm facing the problem of trying to extract data from word files in the form of tables. I have to iterate through 500 word files and extract a specific table in each file, but the table appears at a different point in each word file. This is the code I have:
import pandas as pd
from docx.api import Document
import os
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(path "\\" filename)
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
table = wordDoc.tables[8]
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
row_data = (text)
data.append(row_data)
df = pd.DataFrame(data)
print(df)
Which goes through all the files fine, but gets an error as some of the word documents do not have the table it looks for, as it looks only for an element: wordDoc.tables[8] so an IndexError appears. I want to be able to change it from this, to instead look for a table with certain column titles: CONTACT NAME POSITION LOCATION EMAIL TELEPHONE ASSET CLASS
Is there a way that I can modify the code shown to be able to find the tables I'm looking for?
Many thanks.
CodePudding user response:
Instead of changing the logic to look up tables with certain column names you can catch the index error and ignore it. This will enable you to continue without error when that table is not present in a document. This is done using try
and except
.
import pandas as pd
from docx.api import Document
import os
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
try:
table = wordDoc.tables[8]
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
row_data = (text)
data.append(row_data)
except IndexError:
continue
df = pd.DataFrame(data)
print(df)
Also, note that it is better to use os.path.join()
when combining paths instead of concatenating the path strings.