Home > Back-end >  How to de-duplicate values within a column while groupby another column in Python?
How to de-duplicate values within a column while groupby another column in Python?

Time:09-06

I am new to Python and do not know how to combine groupby and duplicated functions to solve my problem. I need to de-duplicate Negative values in column Result and keep only the first-appearing Negative value and set the rest to NaN while grouping by column Year (see the table below).

here is my dataframe:

df = pd.DataFrame( { "Year" : ["2004", "2004", "2004", "2005", "2005","2005", "2005", "2003","2003", "2003", "2003"] , "Result" : ["NaN", "Negative", "Negative", "Negative", "NaN", "Negative", "NaN","Neative", "NaN", "Negative", "NaN"] } )

I used this code which doesnt work:

df['Result'] = df.groupby(['Year'])['Result'].duplicated()

The original table looks like this:

Year Result
2004 NaN
2004 Negative
2004 Negative
2005 Negative
2005 NaN
2005 Negative
2005 NaN
2003 Negative
2003 NaN
2003 Negative
2003 NaN

But I want to de-duplicate 'Negative' values in the 'Result' column, grouped by 'Year', and update the 'Result' column, so it looks like below:

Year Result
2004 NaN
2004 Negative
2004 NaN
2005 Negative
2005 NaN
2005 NaN
2005 NaN
2003 Negative
2003 NaN
2003 NaN
2003 NaN

CodePudding user response:

Use DataFrame.duplicated with Series.mask:

df['Result'] = df['Result'].mask(df.duplicated(['Year','Result']))
print (df)
    Year    Result
0   2004       NaN
1   2004  Negative
2   2004       NaN
3   2005  Negative
4   2005       NaN
5   2005       NaN
6   2005       NaN
7   2003  Negative
8   2003       NaN
9   2003       NaN
10  2003       NaN
  • Related