Python complex iterating through excel files to concatenate colnames that are not named equal-CodePudding

I have multiple xls files in a directory.

each file dataframe headers are different but data type is same.

1.xls
Location       StreetAddress
America Pvtld   800 st

2.xls           
Billtoaddress     Company
north main st     Europe & Co

3.xls           
Billed-to-Name   Billingaddress
America Pvtld    3302 snellmae rd

Each xls files have multiple sheets with same formating within them

essentially, companyname = location = Billed-to-Name i.e. all are same. ditto with address

My final objective is to merge all files and sheets into one where my df looks like

final.xls

Company         Address     RowNumber Sheetname Filename
America Pvtld   800 st       1         sheet1   1.xls
Europe & Co    north main st    10     sheet2   2.xls
America Pvtld  3302 snellmae rd  122   sheet1   3.xls

where rownumber represents the original row number from the sheet it was sourced from and sheetname is the name of the sheet within the xls files. How do i concat them? there are more column names in xls files but finalxls does not need all. the columns are not in same position as others as you can see in 2.xls

CodePudding user response：

This code does what you want but with csv files; for some reason I couldn't read xls files quickly so I just used csv. Assuming you can read xls files, this should do what you want.

files = ["1.csv", "2.csv", "3.csv"]
colnames = ["company", "address", "RowNumber", "Sheetname"]
df = pd.DataFrame(columns=colnames)  # create empty df
for file in files:
    df_file = pd.read_csv(file, skiprows=1, names=colnames)  # skip the column names
    df_file["RowNumber"] = df_file.index  # fill new data
    df_file["Sheetname"] = "sheet"   file.split(".")[0]
    df = df.append(df_file)

df.columns = colnames  # EDIT: UNNECESSARY

print(df)
         company           address RowNumber Sheetname
0  America Pvtld            800 st         0    sheet1
0    Europe & Co     north main st         0    sheet2
0  America Pvtld  3302 snellmae rd         0    sheet3

EDIT: To read files from the directory whatever their names are you can do the following:

colnames = ["company", "address", "RowNumber", "Sheetname"]
df = pd.DataFrame(columns=colnames)
for file in os.listdir(<directory of xls files>):
    if not file.endswith(".xls"):
        continue

and the rest is the same.

CodePudding user response：

You can try the following. Basically, we first create a mapping of column names so that could rename them as we collect each sheet. Then create ExcelFile object for each Excel document and read their sheet names and use parse to read each sheet. Then use assign to assign new columns and rename to modify column names. Finally concat to build a single DataFrame.

cols_mapping = {'Company':'Location', 'Billed-to-Name':'Location', 
                'StreetAddress':'Billtoaddress', 'Billingaddress':'Billtoaddress'}
out = []
for f in filenames:
    file = pd.ExcelFile(f)
    for sn in file.sheet_names:
        tmp = (file.parse(sheet_name=f"{sn}")
               .rename(columns=cols_mapping)
               .assign(Sheetname=sn, Filename=f.rsplit('\\')[-1]))
        out.append(tmp)
out = pd.concat(out).reset_index().rename(columns={'index':'RowNumber'})

Output:

   RowNumber        Company           Address  Sheetname  Filename
0          0  America Pvtld            800 st     sheet1    1.xlsx
1          0    Europe & Co     north main st     sheet2    2.xlsx
2          0  America Pvtld  3302 snellmae rd     sheet1    3.xlsx