Drop dublicates in pandas but keep duplicates between years-CodePudding

I have a dataset like this

ID     Year    Day
 1     2001    150
 2     2001    140
 3     2001    120
 3     2002    160
 3     2002    160
 3     2017     75
 3     2017     75
 4     2017     80

I would like to drop the duplicates within each year, but keep those were the year differs. End result would be this:

 1     2001    150
 2     2001    140
 3     2001    120
 3     2002    160
 3     2017     75
 4     2017     80

I tried to do something like this in python with my pandas dataframe:

data = read_csv('data.csv')
data = data.drop_duplicates(subset = ['ID'], keep = first)

But this will delete duplicates between years, while I would like to keep this.

How do I keep the duplicates between years?

CodePudding user response：

add 'year' in your subset.

data.drop_duplicates(subset = ['ID','Year'], keep = 'first')

    ID  Year    Day
0   1   2001    150
1   2   2001    140
2   3   2001    120
3   3   2002    160
5   3   2017    75
7   4   2017    80

CodePudding user response：

Another possible solution:

df.groupby('Year').apply(lambda g: g.drop_duplicates()).reset_index(drop=True)

Output:

   ID  Year  Day
0   1  2001  150
1   2  2001  140
2   3  2001  120
3   3  2002  160
4   3  2017   75
5   4  2017   80