Apply forward to multiple columns in dataframe with the same name-CodePudding

I have a dataframe that has columns with a repeated name. I have this code that does a forward-fill for blank values. However, it only does it on the first column found with that name.

Example dataframe:

import pandas as pd

data_dict = {'name1': [1.0, '', 2.0, '', 5.0], 'name2': [10.0, '', 14.0, 18.0, ''], 'name3': ['some string2', 'some string3', 'some string4', 'some string5', 'some string6'], 'name4': ['description2', 'description3', 'description4', 'description5', 'description6'], 'name2.1': [36.0, '', '', 44.0, ''], 'name6': ['more text2', 'more text3', 'more text4', 'more text5', 'more text6']}

df = pd.DataFrame.from_dict(data_dict)
df

Dataframe:

name1   name2   name3          name4           name2    name6
1       10      some string2    description2    36      more text2
                some string3    description3            more text3
2       14      some string4    description4            more text4
        18      some string5    description5    44      more text5
5               some string6    description6            more text6

Here is my code:

backfill_header_list = [
    'Name1',
    'Name2'
]    
for i in backfill_header_list:
    df.loc[:, i] = df.loc[:, i].fillna(method='ffill')

For this example, Name2 is the repeated column name.

Desired output:

name1   name2   name3          name4           name2    name6
1       10      some string2    description2    36      more text2
1       10      some string3    description3    36      more text3
2       14      some string4    description4    36      more text4
2       18      some string5    description5    44      more text5
5       18      some string6    description6    44      more text6

Is there an efficient way to have pandas iterate through all columns that match that name?

CodePudding user response：

If you already have the list of names to be matched, just referencing will be enough.

data = {'a':[1,np.nan,2,np.nan,5],'b':[10,np.nan,14,18,np.nan],'c':['somestrings']*5,'d':['descriptions']*5,'e':[36,np.nan,np.nan,44,np.nan], 'f':['more texts']*5}
df= pd.DataFrame(data)
df.columns = ['name1','name2','name3','name4','name2','name6']

df.loc[:,backfill_header_list].fillna(method='ffill')
 
   name1  name2  name2
0    1.0   10.0   36.0
1    1.0   10.0   36.0
2    2.0   14.0   36.0
3    2.0   18.0   44.0
4    5.0   18.0   44.0

However, if you want to assign it back to your original dataframe, it can't be done due to the same multiple column names.

df.loc[:,backfill_header_list] = df.loc[:,backfill_header_list].fillna(method='ffill')

ValueError: Setting with non-unique columns is not allowed.

So, what you can do is using df.apply.

df[backfill_header_list] = df[backfill_header_list].apply(lambda x : x.fillna(method='ffill'))

   name1  name2        name3         name4  name2       name6
0    1.0   10.0  somestrings  descriptions   36.0  more texts
1    1.0   10.0  somestrings  descriptions   36.0  more texts
2    2.0   14.0  somestrings  descriptions   36.0  more texts
3    2.0   18.0  somestrings  descriptions   44.0  more texts
4    5.0   18.0  somestrings  descriptions   44.0  more texts

However, it's better not to have the same column names in your dataframe because it will certainly cause you many problems later on.

CodePudding user response：

In case anyone else comes across this problem, here's what I came up with.

I essentially created a function to compare the headers to my backfill_header_list and create a list of indexes to loop through in my backfill for loop.

headers = list(df.columns)

backfill_header_list = [
    'Name1',
    'Name2'
backfill_index_list = []
    for i, col in enumerate(headers):
        if col in backfill_header_list:
            backfill_index_list.append(i)

for i in backfill_index_list:
    df.iloc[:, i] = df.iloc[:, i].fillna(method='ffill')

CodePudding user response：

I think you can use this code below it simialr to @Kevin answer

df.loc[:,df.columns].fillna(method='ffill')

if you want to modify last column of you duplicated column you can use

df.loc[:,~df.T.duplicated(keep='last')]..fillna(method='ffill')

but as all comments
Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.