I have multiple xls files in a directory.
each file dataframe headers are different but data type is same.
1.xls
Location StreetAddress
America Pvtld 800 st
2.xls
Billtoaddress Company
north main st Europe & Co
3.xls
Billed-to-Name Billingaddress
America Pvtld 3302 snellmae rd
Each xls files have multiple sheets with same formating within them
essentially, companyname = location = Billed-to-Name i.e. all are same. ditto with address
My final objective is to merge all files and sheets into one where my df looks like
final.xls
Company Address RowNumber Sheetname Filename
America Pvtld 800 st 1 sheet1 1.xls
Europe & Co north main st 10 sheet2 2.xls
America Pvtld 3302 snellmae rd 122 sheet1 3.xls
where rownumber represents the original row number from the sheet it was sourced from and sheetname is the name of the sheet within the xls files. How do i concat them? there are more column names in xls files but finalxls does not need all. the columns are not in same position as others as you can see in 2.xls
CodePudding user response:
This code does what you want but with csv files; for some reason I couldn't read xls files quickly so I just used csv. Assuming you can read xls files, this should do what you want.
files = ["1.csv", "2.csv", "3.csv"]
colnames = ["company", "address", "RowNumber", "Sheetname"]
df = pd.DataFrame(columns=colnames) # create empty df
for file in files:
df_file = pd.read_csv(file, skiprows=1, names=colnames) # skip the column names
df_file["RowNumber"] = df_file.index # fill new data
df_file["Sheetname"] = "sheet" file.split(".")[0]
df = df.append(df_file)
df.columns = colnames # EDIT: UNNECESSARY
print(df)
company address RowNumber Sheetname
0 America Pvtld 800 st 0 sheet1
0 Europe & Co north main st 0 sheet2
0 America Pvtld 3302 snellmae rd 0 sheet3
EDIT: To read files from the directory whatever their names are you can do the following:
colnames = ["company", "address", "RowNumber", "Sheetname"]
df = pd.DataFrame(columns=colnames)
for file in os.listdir(<directory of xls files>):
if not file.endswith(".xls"):
continue
and the rest is the same.
CodePudding user response:
You can try the following. Basically, we first create a mapping of column names so that could rename them as we collect each sheet. Then create ExcelFile object for each Excel document and read their sheet names and use parse
to read each sheet. Then use assign
to assign new columns and rename
to modify column names. Finally concat
to build a single DataFrame.
cols_mapping = {'Company':'Location', 'Billed-to-Name':'Location',
'StreetAddress':'Billtoaddress', 'Billingaddress':'Billtoaddress'}
out = []
for f in filenames:
file = pd.ExcelFile(f)
for sn in file.sheet_names:
tmp = (file.parse(sheet_name=f"{sn}")
.rename(columns=cols_mapping)
.assign(Sheetname=sn, Filename=f.rsplit('\\')[-1]))
out.append(tmp)
out = pd.concat(out).reset_index().rename(columns={'index':'RowNumber'})
Output:
RowNumber Company Address Sheetname Filename
0 0 America Pvtld 800 st sheet1 1.xlsx
1 0 Europe & Co north main st sheet2 2.xlsx
2 0 America Pvtld 3302 snellmae rd sheet1 3.xlsx