Home > Mobile >  Remove duplicates and keep row that certain column is Yes in a pandas dataframe
Remove duplicates and keep row that certain column is Yes in a pandas dataframe

Time:09-09

I have a dataframe with duplicated values on column "ID", like this one:

ID  Name    Street  Birth       Job     Primary?
1   Fake1   Street1 2000-01-01  Job1    Yes
2   Fake2   Street2 2000-01-02  Job2    No
3   Fake3   Street3 2000-01-03  Job3    Yes
1   Fake1   Street1 2000-01-01  Job4    No
2   Fake2   Street2 2000-01-02  Job5    Yes
4   Fake4   Street4 2000-01-03  Job6    Yes
1   Fake1   Street1 2000-01-01  Job7    No

I need a way to remove duplicates (by "ID") but keep the ones that the column Primary is "Yes" (all unique values have "Yes" in that column and duplicated values have one record as "Yes" and all others as "No") resulting in this dataframe:

ID  Name    Street  Birth       Job     Primary?
1   Fake1   Street1 2000-01-01  Job1    Yes
3   Fake3   Street3 2000-01-03  Job3    Yes
2   Fake2   Street2 2000-01-02  Job5    Yes
4   Fake4   Street4 2000-01-03  Job6    Yes

What is the best way to do it?

Thanks!

CodePudding user response:

Use DataFrame.sort_values - Yes rows are in end of DataFrame, so possible use DataFrame.drop_duplicates with keep='last' - this solution should return Primary?=No if exist some ID without Primary?=Yes values:

df = df.sort_values('Primary?').drop_duplicates('ID', keep='last')
print (df)
   ID   Name   Street       Birth   Job Primary?
0   1  Fake1  Street1  2000-01-01  Job1      Yes
2   3  Fake3  Street3  2000-01-03  Job3      Yes
4   2  Fake2  Street2  2000-01-02  Job5      Yes
5   4  Fake4  Street4  2000-01-03  Job6      Yes

CodePudding user response:

  1. Filter the Primary column from dataframe for 'yes':
df = df[df['Primary?']=='yes']
  1. Then, drop duplicates from the filtered dataframe.

df = df.drop_duplicates(subset= ['ID'])

  • Related