So I've used the below code to 1)grab all .docx files 2)convert them to csv files 3)create dataframes for each file. The issue is that once I run the code, I get the following error, "TypeError: Document() takes from 0 to 1 positional argument but 450 were given". I think this issue is coming from passing the list as an argument. Could someone point me out how to fix this problem?
import pandas as pd
import io
import csv
from docx import Document
files = []
for path, subdirs, files in os.walk(r"path"):
for name in files:
# For each file we find, we need to ensure it is a .docx file before adding
# it to our list
if os.path.splitext(os.path.join(path, name))[1] == ".docx":
document_list.append(os.path.join(path, name))
def read_docx_tables(*files, tab_id=None, **kwargs):
"""
parse table(s) from a Word Document (.docx) into Pandas DataFrame(s)
Parameters:
filename: file name of a Word Document
tab_id: parse a single table with the index: [tab_id] (counting from 0).
When [None] - return a list of DataFrames (parse all tables)
kwargs: arguments to pass to `pd.read_csv()` function
Return: a single DataFrame if tab_id != None or a list of DataFrames otherwise
"""
def read_docx_tab(tab, **kwargs):
vf = io.StringIO()
writer = csv.writer(vf)
for row in tab.rows:
writer.writerow(cell.text for cell in row.cells)
vf.seek(0)
return pd.read_csv(vf, **kwargs)
doc = Document(*files)
if tab_id is None:
return [read_docx_tab(tab, **kwargs) for tab in doc.tables]
else:
try:
return read_docx_tab(doc.tables[tab_id], **kwargs)
except IndexError:
print('Error: specified [tab_id]: {} does not exist.'.format(tab_id))
raise
CodePudding user response:
From help(Document)
:
Document(docx=None)
Return a |Document| object loaded from *docx*, where *docx* can be either a path to a ``.docx`` file (a string) or a file-like object. If *docx* is missing or ``None``, the built-in default document "template" is loaded.
So you need to process individual files:
for file in files:
doc = Document(file)