Drop the duplicate where boolean column = False-CodePudding

In the following example, I have been trying to drop the duplicate that is == false. I mean when id and year match with more than 1 row (subset =[id, year])

df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '3', '4', '4'],
                    'Year': [2000, 2000, 2003, 2004, 2004, 2002, 2001, 2003], 'Boolean':['false', 'true', 'true', 'true', 'false', 'true', 'true', 'true']})
print(df)

# Output

  id  Year Boolean
0  1  2000   false
1  1  2000    true
2  1  2003    true
3  2  2004    true
4  2  2004   false
5  3  2002    true
6  4  2001    true
7  4  2003    true

# Attempted code

df2 = df.loc[df['Year'].eq('false').groupby([df.id]).idxmax()]
print(df2)

  id  Year Boolean
0  1  2000   false
3  2  2004    true
5  3  2002    true
6  4  2001    true

This code is incorrect as I am trying to keep the Boolean == false observations when there is a duplicate. It is also dropping other observations that are not duplicates. Not sure if it is easier to use the duplicate function for this.

CodePudding user response：

"I have been trying to drop the duplicate that is == false. I mean when id and year match with more than 1 row (subset =[id, year])" -> so you need to group by both ID and Year (and use the correct column as boolean source):

df.loc[df['Boolean'].eq('false').groupby([df['id'], df['Year']]).idxmax()]

output:

  id  Year Boolean
0  1  2000   false
2  1  2003    true
4  2  2004   false
5  3  2002    true
6  4  2001    true
7  4  2003    true

booleans

Note that it would be much better to use real booleans (faster, less memory, shorter syntax…)

# convert to booleans
df['Boolean'] = df['Boolean'].eq('true')

df.loc[df.groupby(['id', 'Year'])['Boolean'].idxmin()]

  id  Year  Boolean
0  1  2000    False
2  1  2003     True
4  2  2004    False
5  3  2002     True
6  4  2001     True
7  4  2003     True