In the following example, I have been trying to drop the duplicate that is == false. I mean when id and year match with more than 1 row (subset =[id, year])
df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '3', '4', '4'],
'Year': [2000, 2000, 2003, 2004, 2004, 2002, 2001, 2003], 'Boolean':['false', 'true', 'true', 'true', 'false', 'true', 'true', 'true']})
print(df)
# Output
id Year Boolean
0 1 2000 false
1 1 2000 true
2 1 2003 true
3 2 2004 true
4 2 2004 false
5 3 2002 true
6 4 2001 true
7 4 2003 true
# Attempted code
df2 = df.loc[df['Year'].eq('false').groupby([df.id]).idxmax()]
print(df2)
id Year Boolean
0 1 2000 false
3 2 2004 true
5 3 2002 true
6 4 2001 true
This code is incorrect as I am trying to keep the Boolean == false observations when there is a duplicate. It is also dropping other observations that are not duplicates. Not sure if it is easier to use the duplicate function for this.
CodePudding user response:
"I have been trying to drop the duplicate that is == false. I mean when id and year match with more than 1 row (subset =[id, year])" -> so you need to group by both ID and Year (and use the correct column as boolean source):
df.loc[df['Boolean'].eq('false').groupby([df['id'], df['Year']]).idxmax()]
output:
id Year Boolean
0 1 2000 false
2 1 2003 true
4 2 2004 false
5 3 2002 true
6 4 2001 true
7 4 2003 true
booleans
Note that it would be much better to use real booleans (faster, less memory, shorter syntax…)
# convert to booleans
df['Boolean'] = df['Boolean'].eq('true')
df.loc[df.groupby(['id', 'Year'])['Boolean'].idxmin()]
id Year Boolean
0 1 2000 False
2 1 2003 True
4 2 2004 False
5 3 2002 True
6 4 2001 True
7 4 2003 True