I have a dataset like this
ID Year Day
1 2001 150
2 2001 140
3 2001 120
3 2002 160
3 2002 160
3 2017 75
3 2017 75
4 2017 80
I would like to drop the duplicates within each year, but keep those were the year differs. End result would be this:
1 2001 150
2 2001 140
3 2001 120
3 2002 160
3 2017 75
4 2017 80
I tried to do something like this in python with my pandas dataframe:
data = read_csv('data.csv')
data = data.drop_duplicates(subset = ['ID'], keep = first)
But this will delete duplicates between years, while I would like to keep this.
How do I keep the duplicates between years?
CodePudding user response:
add 'year' in your subset.
data.drop_duplicates(subset = ['ID','Year'], keep = 'first')
ID Year Day
0 1 2001 150
1 2 2001 140
2 3 2001 120
3 3 2002 160
5 3 2017 75
7 4 2017 80
CodePudding user response:
Another possible solution:
df.groupby('Year').apply(lambda g: g.drop_duplicates()).reset_index(drop=True)
Output:
ID Year Day
0 1 2001 150
1 2 2001 140
2 3 2001 120
3 3 2002 160
4 3 2017 75
5 4 2017 80