I have a dataframe that has columns with a repeated name. I have this code that does a forward-fill for blank values. However, it only does it on the first column found with that name.
Example dataframe:
import pandas as pd
data_dict = {'name1': [1.0, '', 2.0, '', 5.0], 'name2': [10.0, '', 14.0, 18.0, ''], 'name3': ['some string2', 'some string3', 'some string4', 'some string5', 'some string6'], 'name4': ['description2', 'description3', 'description4', 'description5', 'description6'], 'name2.1': [36.0, '', '', 44.0, ''], 'name6': ['more text2', 'more text3', 'more text4', 'more text5', 'more text6']}
df = pd.DataFrame.from_dict(data_dict)
df
Dataframe:
name1 name2 name3 name4 name2 name6
1 10 some string2 description2 36 more text2
some string3 description3 more text3
2 14 some string4 description4 more text4
18 some string5 description5 44 more text5
5 some string6 description6 more text6
Here is my code:
backfill_header_list = [
'Name1',
'Name2'
]
for i in backfill_header_list:
df.loc[:, i] = df.loc[:, i].fillna(method='ffill')
For this example, Name2
is the repeated column name.
Desired output:
name1 name2 name3 name4 name2 name6
1 10 some string2 description2 36 more text2
1 10 some string3 description3 36 more text3
2 14 some string4 description4 36 more text4
2 18 some string5 description5 44 more text5
5 18 some string6 description6 44 more text6
Is there an efficient way to have pandas iterate through all columns that match that name?
CodePudding user response:
If you already have the list of names to be matched, just referencing will be enough.
data = {'a':[1,np.nan,2,np.nan,5],'b':[10,np.nan,14,18,np.nan],'c':['somestrings']*5,'d':['descriptions']*5,'e':[36,np.nan,np.nan,44,np.nan], 'f':['more texts']*5}
df= pd.DataFrame(data)
df.columns = ['name1','name2','name3','name4','name2','name6']
df.loc[:,backfill_header_list].fillna(method='ffill')
name1 name2 name2
0 1.0 10.0 36.0
1 1.0 10.0 36.0
2 2.0 14.0 36.0
3 2.0 18.0 44.0
4 5.0 18.0 44.0
However, if you want to assign it back to your original dataframe, it can't be done due to the same multiple column names.
df.loc[:,backfill_header_list] = df.loc[:,backfill_header_list].fillna(method='ffill')
ValueError: Setting with non-unique columns is not allowed.
So, what you can do is using df.apply
.
df[backfill_header_list] = df[backfill_header_list].apply(lambda x : x.fillna(method='ffill'))
name1 name2 name3 name4 name2 name6
0 1.0 10.0 somestrings descriptions 36.0 more texts
1 1.0 10.0 somestrings descriptions 36.0 more texts
2 2.0 14.0 somestrings descriptions 36.0 more texts
3 2.0 18.0 somestrings descriptions 44.0 more texts
4 5.0 18.0 somestrings descriptions 44.0 more texts
However, it's better not to have the same column names in your dataframe because it will certainly cause you many problems later on.
CodePudding user response:
In case anyone else comes across this problem, here's what I came up with.
I essentially created a function to compare the headers to my backfill_header_list
and create a list of indexes to loop through in my backfill for loop.
headers = list(df.columns)
backfill_header_list = [
'Name1',
'Name2'
backfill_index_list = []
for i, col in enumerate(headers):
if col in backfill_header_list:
backfill_index_list.append(i)
for i in backfill_index_list:
df.iloc[:, i] = df.iloc[:, i].fillna(method='ffill')
CodePudding user response:
I think you can use this code below it simialr to @Kevin answer
df.loc[:,df.columns].fillna(method='ffill')
if you want to modify last column of you duplicated column you can use
df.loc[:,~df.T.duplicated(keep='last')]..fillna(method='ffill')
but as all comments
Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.