Home > Software design >  Drop the duplicate where boolean column = False
Drop the duplicate where boolean column = False

Time:05-27

In the following example, I have been trying to drop the duplicate that is == false. I mean when id and year match with more than 1 row (subset =[id, year])

df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '3', '4', '4'],
                    'Year': [2000, 2000, 2003, 2004, 2004, 2002, 2001, 2003], 'Boolean':['false', 'true', 'true', 'true', 'false', 'true', 'true', 'true']})
print(df)

# Output

  id  Year Boolean
0  1  2000   false
1  1  2000    true
2  1  2003    true
3  2  2004    true
4  2  2004   false
5  3  2002    true
6  4  2001    true
7  4  2003    true
# Attempted code

df2 = df.loc[df['Year'].eq('false').groupby([df.id]).idxmax()]
print(df2)

  id  Year Boolean
0  1  2000   false
3  2  2004    true
5  3  2002    true
6  4  2001    true

This code is incorrect as I am trying to keep the Boolean == false observations when there is a duplicate. It is also dropping other observations that are not duplicates. Not sure if it is easier to use the duplicate function for this.

CodePudding user response:

"I have been trying to drop the duplicate that is == false. I mean when id and year match with more than 1 row (subset =[id, year])" -> so you need to group by both ID and Year (and use the correct column as boolean source):

df.loc[df['Boolean'].eq('false').groupby([df['id'], df['Year']]).idxmax()]

output:

  id  Year Boolean
0  1  2000   false
2  1  2003    true
4  2  2004   false
5  3  2002    true
6  4  2001    true
7  4  2003    true

booleans

Note that it would be much better to use real booleans (faster, less memory, shorter syntax…)

# convert to booleans
df['Boolean'] = df['Boolean'].eq('true')

df.loc[df.groupby(['id', 'Year'])['Boolean'].idxmin()]

  id  Year  Boolean
0  1  2000    False
2  1  2003     True
4  2  2004    False
5  3  2002     True
6  4  2001     True
7  4  2003     True
  • Related