I am new to Python and do not know how to combine groupby and duplicated functions to solve my problem. I need to de-duplicate Negative values in column Result and keep only the first-appearing Negative value and set the rest to NaN while grouping by column Year (see the table below).
here is my dataframe:
df = pd.DataFrame( { "Year" : ["2004", "2004", "2004", "2005", "2005","2005", "2005", "2003","2003", "2003", "2003"] , "Result" : ["NaN", "Negative", "Negative", "Negative", "NaN", "Negative", "NaN","Neative", "NaN", "Negative", "NaN"] } )
I used this code which doesnt work:
df['Result'] = df.groupby(['Year'])['Result'].duplicated()
The original table looks like this:
Year | Result |
---|---|
2004 | NaN |
2004 | Negative |
2004 | Negative |
2005 | Negative |
2005 | NaN |
2005 | Negative |
2005 | NaN |
2003 | Negative |
2003 | NaN |
2003 | Negative |
2003 | NaN |
But I want to de-duplicate 'Negative' values in the 'Result' column, grouped by 'Year', and update the 'Result' column, so it looks like below:
Year | Result |
---|---|
2004 | NaN |
2004 | Negative |
2004 | NaN |
2005 | Negative |
2005 | NaN |
2005 | NaN |
2005 | NaN |
2003 | Negative |
2003 | NaN |
2003 | NaN |
2003 | NaN |
CodePudding user response:
Use DataFrame.duplicated
with Series.mask
:
df['Result'] = df['Result'].mask(df.duplicated(['Year','Result']))
print (df)
Year Result
0 2004 NaN
1 2004 Negative
2 2004 NaN
3 2005 Negative
4 2005 NaN
5 2005 NaN
6 2005 NaN
7 2003 Negative
8 2003 NaN
9 2003 NaN
10 2003 NaN