I have excel files which each have 3 sheets and all same column headers.
I want to merge all so that I get one combined dataframe. This dataframe should have the sheetname,row number and xls filename listed next to the final dataframe.
i tried
import pandas as pd
dfe = pd.concat(pd.read_excel('abc1.xls', sheet_name=None), ignore_index=True)
this combined all sheets for one of the files but did not write the row numbers or sheetnames next to the data. How can I do the same?
Desired output-
Rownumber Sheetname Filename State Region Brand
1 tig abc1.xls CA S Go
2 con abc2.xls IA A Po
CodePudding user response:
Here's how to do it. I used two functions. The first function reads all sheets within a single Excel file, and adds the sheet name. The second function takes all of the excel files, and uses the first function to read all the sheets in all the files.
from pandas import pd
def read_sheets(filename):
result = []
sheets = pd.read_excel(filename, sheet_name=None)
for name, sheet in sheets.items():
sheet['Sheetname'] = name
sheet['Row'] = sheet.index
result.append(sheet)
return pd.concat(result, ignore_index=True)
def read_files(filenames):
result = []
for filename in filenames:
file = read_sheets(filename)
file['Filename'] = filename
result.append(file)
return pd.concat(result, ignore_index=True)
You can call this by providing a list of files to read:
files = ['multisheet.xls', 'multisheet2.xls']
read_files(files)
For the example I tried it on, it produces a dataframe like this:
A B A B Sheetname Row Filename
0 1 10 11 Sheet1 0 multisheet.xls
1 2 11 13 Sheet1 1 multisheet.xls
2 3 12 15 Sheet1 2 multisheet.xls
3 4 13 17 Sheet1 3 multisheet.xls
4 3 10 13 Sheet2 0 multisheet.xls
5 3 11 14 Sheet2 1 multisheet.xls
6 3 12 15 Sheet2 2 multisheet.xls
7 3 13 16 Sheet2 3 multisheet.xls
8 1 10 11 Sheet1 0 multisheet2.xls
9 2 11 13 Sheet1 1 multisheet2.xls
10 3 12 15 Sheet1 2 multisheet2.xls
11 4 13 17 Sheet1 3 multisheet2.xls
12 4 10 13 Sheet2 0 multisheet2.xls
13 3 11 14 Sheet2 1 multisheet2.xls
14 3 12 15 Sheet2 2 multisheet2.xls
15 3 13 16 Sheet2 3 multisheet2.xls
CodePudding user response:
Try this:
list_dfs = []
for f in ['abc1.xls', 'abc2.xls', 'abc3.xls']:
dd = pd.read_excel(f'{f}', sheet_name=None)
df = pd.concat(dd, keys=dd.keys()).stack().reset_index().assign(Filename=f'{f}')
list_dfs.append(df)
df_out = pd.concat(list_dfs, ignore_index=True)