The following function is what I have come up with to iterate through multiple excel files to store the data into a single data frame. However, only the data from the final file is being stored in the final data frame. What should I do to get the data from all the files to be stored in the same df and then exported to a csv file ?
def excel_to_df(folder, start_row, end_row, start_col, end_col):
# loop through all excel files in the folder
for file in os.listdir(folder):
exact_file_path = folder "\\\\" file
print(exact_file_path)
# check if file is an excel file
if file.endswith('xlsx'):
# create workbook and its worksheets for each file
wb = openpyxl.load_workbook(exact_file_path)
ws = wb.worksheets
# create a list to store the dataframes
df_list = []
# iterate over the worksheets
for worksheet in ws:
# get the name of the worksheet
name = worksheet.title
# create an empty list to store the values
cell_values = []
# iterate over the rows and columns in the range
for row in worksheet.iter_rows(min_row = row_min, max_row = row_max,
min_col = col_min, max_col = col_max):
# append the cell values to the list
cell_values.append([cell.value for cell in row])
# create a dataframe from the cell values and the worksheet name
df = pd.DataFrame(cell_values, columns=range(start_col, end_col 1), index=[name]*len(cell_values))
# append the df to the list
df_list.append(df)
# concatenate the list of dataframes into a single dataframe
df = pd.concat(df_list)
# save the output to a csv file
df.to_csv('test.csv', index=True)
return df
CodePudding user response:
Your immediate problem is that you're creating df_list
inside the loop so that each time the loop starts over, it will overwrite whatever was already in it. Additionally, your return
is at the end of (and inside) the loop so it doesn't ever get to the second element. When it gets to the return
it gives you what it has and stops running. You just need to rearrange it, like this:
def excel_to_df(folder, start_row, end_row, start_col, end_col):
# create a list to store the dataframes
df_list = []
# loop through all excel files in the folder
for file in os.listdir(folder):
exact_file_path = folder "\\\\" file
print(exact_file_path)
# check if file is an excel file
if file.endswith('xlsx'):
# create workbook and its worksheets for each file
wb = openpyxl.load_workbook(exact_file_path)
ws = wb.worksheets
else:
# if the file doesn't end with xlsx then don't try to open it as though it is
next
# iterate over the worksheets
for worksheet in ws:
# get the name of the worksheet
name = worksheet.title
# create an empty list to store the values
cell_values = []
# iterate over the rows and columns in the range
for row in worksheet.iter_rows(min_row = row_min, max_row = row_max,
min_col = col_min, max_col = col_max):
# append the cell values to the list
cell_values.append([cell.value for cell in row])
# create a dataframe from the cell values and the worksheet name
df = pd.DataFrame(cell_values, columns=range(start_col, end_col 1), index=[name]*len(cell_values))
# append the df to the list
df_list.append(df)
# concatenate the list of dataframes into a single dataframe
df = pd.concat(df_list)
# save the output to a csv file
df.to_csv('test.csv', index=True)
return df
As an aside, is there a reason you're manually creating a DF instead of just using pd.read_excel
? If not, I'd recommend getting rid of your for row
loop and just use pd.read_excel(exact_file_path, sheet_name=worksheet.title)