Home > Blockchain >  How to rename columns of list of dataframes in pandas?
How to rename columns of list of dataframes in pandas?

Time:04-29

I have list of dataframes where each has different columns, and I want to assign unique column names to all and combine it but it is not working. Is there any quick way to do this in pandas?

my attempt

!pip install wget

import wget
import pandas as pd

url = 'https://github.com/adamFlyn/test_rl/blob/main/test_data.xlsx'
data= wget.download(url)

xls = pd.ExcelFile('~/test_data.xlsx')
names = xls.sheet_names[1:]
# iterate to find sheet name that matches 
data_dict = pd.read_excel(xls, sheet_name = [name for name in xls.sheet_names if name in names])

dfs=[]
for key, val in data_dict.items():
    val['state_abbr'] = key
    dfs.append(val)

for df in dfs:
    st=df.columns[0]
    df['state']=st
    df.reset_index()


for df in dfs:
    lst=df.columns.tolist()
    lst=['county','orientation','state_abbr','state']
    df.columns=lst

final_df=pd.concat(dfs, axis=1, inplace=True) 

but I am not able to rename coumns of each dataframe like this and have this error:

for df in dfs:
    lst=df.columns.tolist()
    lst=['county','orientation','state_abbr','state']
    df.columns=lst

ValueError: Length mismatch: Expected axis has 5 elements, new values have 4 elements

how should I do this in pandas? any quick thoughts or trick to do this? thanks

CodePudding user response:

The error was coming from the data. Almost all DataFrames sheets had 3 columns but only "NC" had a redundant column that starts as "Unnamed", which is almost all NaN except for one row which has "`" as value. If we remove that column from that sheet, the rest of the code works as expected.

You can assign new columns using assign and change column names using set_axis in a dict comprehension. Also, instead of a list comprehension to get the sheet names, you can use names itself. Finally, simply concatenate all with concat.

out = pd.concat([df.loc[:, ~df.columns.str.startswith('Unnamed')]
                 .set_axis(['county','orientation'], axis=1)
                 .assign(state=df.columns[0], state_abbr=k)
                 for k, df in pd.read_excel(xls, sheet_name = names).items()])

Output:

            county orientation     state state_abbr
0   Aleutians East  Plaintiff     Alaska         AK
1   Aleutians West  Plaintiff     Alaska         AK
2        Anchorage     Neutral    Alaska         AK
3           Bethel  Plaintiff     Alaska         AK
4      Bristol Bay  Plaintiff     Alaska         AK
..             ...         ...       ...        ...
18      Sweetwater     Neutral  Wyoming          WY
19           Teton     Neutral  Wyoming          WY
20           Uinta    Defense   Wyoming          WY
21        Washakie    Defense   Wyoming          WY
22          Weston     Defense  Wyoming          WY

[3117 rows x 4 columns]

CodePudding user response:

Here, when you run reset_index() this doesn't actually do anything inplace.

for df in dfs:
    st=df.columns[0]
    df['state']=st
    df.reset_index() # <----- returns df with reset index, but is not assigned.

CodePudding user response:

you should move df.columns=lst to out of loop

    for df in dfs:
        lst=df.columns.tolist()
        lst=['county','orientation','state_abbr','state']
    df.columns=lst
  • Related