I have list of dataframes where each has different columns, and I want to assign unique column names to all and combine it but it is not working. Is there any quick way to do this in pandas?
my attempt
!pip install wget
import wget
import pandas as pd
url = 'https://github.com/adamFlyn/test_rl/blob/main/test_data.xlsx'
data= wget.download(url)
xls = pd.ExcelFile('~/test_data.xlsx')
names = xls.sheet_names[1:]
# iterate to find sheet name that matches
data_dict = pd.read_excel(xls, sheet_name = [name for name in xls.sheet_names if name in names])
dfs=[]
for key, val in data_dict.items():
val['state_abbr'] = key
dfs.append(val)
for df in dfs:
st=df.columns[0]
df['state']=st
df.reset_index()
for df in dfs:
lst=df.columns.tolist()
lst=['county','orientation','state_abbr','state']
df.columns=lst
final_df=pd.concat(dfs, axis=1, inplace=True)
but I am not able to rename coumns of each dataframe like this and have this error:
for df in dfs:
lst=df.columns.tolist()
lst=['county','orientation','state_abbr','state']
df.columns=lst
ValueError: Length mismatch: Expected axis has 5 elements, new values have 4 elements
how should I do this in pandas? any quick thoughts or trick to do this? thanks
CodePudding user response:
The error was coming from the data. Almost all DataFrames sheets had 3 columns but only "NC" had a redundant column that starts as "Unnamed", which is almost all NaN except for one row which has "`"
as value. If we remove that column from that sheet, the rest of the code works as expected.
You can assign new columns using assign
and change column names using set_axis
in a dict comprehension. Also, instead of a list comprehension to get the sheet names, you can use names
itself. Finally, simply concatenate all with concat
.
out = pd.concat([df.loc[:, ~df.columns.str.startswith('Unnamed')]
.set_axis(['county','orientation'], axis=1)
.assign(state=df.columns[0], state_abbr=k)
for k, df in pd.read_excel(xls, sheet_name = names).items()])
Output:
county orientation state state_abbr
0 Aleutians East Plaintiff Alaska AK
1 Aleutians West Plaintiff Alaska AK
2 Anchorage Neutral Alaska AK
3 Bethel Plaintiff Alaska AK
4 Bristol Bay Plaintiff Alaska AK
.. ... ... ... ...
18 Sweetwater Neutral Wyoming WY
19 Teton Neutral Wyoming WY
20 Uinta Defense Wyoming WY
21 Washakie Defense Wyoming WY
22 Weston Defense Wyoming WY
[3117 rows x 4 columns]
CodePudding user response:
Here, when you run reset_index()
this doesn't actually do anything inplace.
for df in dfs:
st=df.columns[0]
df['state']=st
df.reset_index() # <----- returns df with reset index, but is not assigned.
CodePudding user response:
you should move df.columns=lst to out of loop
for df in dfs:
lst=df.columns.tolist()
lst=['county','orientation','state_abbr','state']
df.columns=lst