I have a dataframe shown below
pid date label age
0 0001 2001-09-24 False 34
1 0001 2006-03-02 True 39
2 0001 2006-03-02 True 39
3 0002 2003-02-07 True 23
4 0002 2004-08-02 True 24
5 0002 2004-08-02 False 24
6 0003 2001-05-25 False 25
7 0003 2001-05-25 False 25
Which I want to aggregate based on both pid
and date
, where if multiple rows have the same pid
and data
then age
stays the same and label
= True if any or all of these rows have True label:
pid date label age
0 0001 2001-09-24 False 34
1 0001 2006-03-02 True 39
3 0002 2003-02-07 True 23
4 0002 2004-08-02 True 24
5 0003 2001-05-25 False 25
I have tried groupby
but all I could do is with one column - pid
and no success with two columns.
CodePudding user response:
In your case sort_values
drop_duplicates
out = df.sort_values('label').drop_duplicates(['pid','date','age'],keep='last').sort_index()
Out[240]:
pid date label age
0 1 2001-09-24 False 34
2 1 2006-03-02 True 39
3 2 2003-02-07 True 23
4 2 2004-08-02 True 24
7 3 2001-05-25 False 25