Home > Software design >  How do I merge multiple xls files into one dataframe in python?
How do I merge multiple xls files into one dataframe in python?

Time:03-16

I have excel files which each have 3 sheets and all same column headers.

I want to merge all so that I get one combined dataframe. This dataframe should have the sheetname,row number and xls filename listed next to the final dataframe.

i tried

 import pandas as pd
 dfe = pd.concat(pd.read_excel('abc1.xls', sheet_name=None), ignore_index=True)

this combined all sheets for one of the files but did not write the row numbers or sheetnames next to the data. How can I do the same?

Desired output-

Rownumber Sheetname Filename         State Region  Brand   
 1           tig       abc1.xls      CA         S   Go     
 2           con       abc2.xls      IA         A   Po

CodePudding user response:

Here's how to do it. I used two functions. The first function reads all sheets within a single Excel file, and adds the sheet name. The second function takes all of the excel files, and uses the first function to read all the sheets in all the files.

from pandas import pd
def read_sheets(filename):
    result = []
    sheets = pd.read_excel(filename, sheet_name=None)
    for name, sheet in sheets.items():
        sheet['Sheetname'] = name
        sheet['Row'] = sheet.index
        result.append(sheet)
    return pd.concat(result, ignore_index=True)

def read_files(filenames):
    result = []
    for filename in filenames:
        file = read_sheets(filename)
        file['Filename'] = filename
        result.append(file)
    return pd.concat(result, ignore_index=True)

You can call this by providing a list of files to read:

files = ['multisheet.xls', 'multisheet2.xls']
read_files(files)

For the example I tried it on, it produces a dataframe like this:

    A   B  A B Sheetname  Row         Filename
0   1  10   11    Sheet1    0   multisheet.xls
1   2  11   13    Sheet1    1   multisheet.xls
2   3  12   15    Sheet1    2   multisheet.xls
3   4  13   17    Sheet1    3   multisheet.xls
4   3  10   13    Sheet2    0   multisheet.xls
5   3  11   14    Sheet2    1   multisheet.xls
6   3  12   15    Sheet2    2   multisheet.xls
7   3  13   16    Sheet2    3   multisheet.xls
8   1  10   11    Sheet1    0  multisheet2.xls
9   2  11   13    Sheet1    1  multisheet2.xls
10  3  12   15    Sheet1    2  multisheet2.xls
11  4  13   17    Sheet1    3  multisheet2.xls
12  4  10   13    Sheet2    0  multisheet2.xls
13  3  11   14    Sheet2    1  multisheet2.xls
14  3  12   15    Sheet2    2  multisheet2.xls
15  3  13   16    Sheet2    3  multisheet2.xls

CodePudding user response:

Try this:

list_dfs = []
for f in ['abc1.xls', 'abc2.xls', 'abc3.xls']:
    dd = pd.read_excel(f'{f}', sheet_name=None)
    df = pd.concat(dd, keys=dd.keys()).stack().reset_index().assign(Filename=f'{f}')
    list_dfs.append(df)
df_out = pd.concat(list_dfs, ignore_index=True)
  • Related