I have a dataframe with duplicated values on column "ID", like this one:
ID Name Street Birth Job Primary?
1 Fake1 Street1 2000-01-01 Job1 Yes
2 Fake2 Street2 2000-01-02 Job2 No
3 Fake3 Street3 2000-01-03 Job3 Yes
1 Fake1 Street1 2000-01-01 Job4 No
2 Fake2 Street2 2000-01-02 Job5 Yes
4 Fake4 Street4 2000-01-03 Job6 Yes
1 Fake1 Street1 2000-01-01 Job7 No
I need a way to remove duplicates (by "ID") but keep the ones that the column Primary is "Yes" (all unique values have "Yes" in that column and duplicated values have one record as "Yes" and all others as "No") resulting in this dataframe:
ID Name Street Birth Job Primary?
1 Fake1 Street1 2000-01-01 Job1 Yes
3 Fake3 Street3 2000-01-03 Job3 Yes
2 Fake2 Street2 2000-01-02 Job5 Yes
4 Fake4 Street4 2000-01-03 Job6 Yes
What is the best way to do it?
Thanks!
CodePudding user response:
Use DataFrame.sort_values
- Yes
rows are in end of DataFrame, so possible use DataFrame.drop_duplicates
with keep='last'
- this solution should return Primary?=No
if exist some ID
without Primary?=Yes
values:
df = df.sort_values('Primary?').drop_duplicates('ID', keep='last')
print (df)
ID Name Street Birth Job Primary?
0 1 Fake1 Street1 2000-01-01 Job1 Yes
2 3 Fake3 Street3 2000-01-03 Job3 Yes
4 2 Fake2 Street2 2000-01-02 Job5 Yes
5 4 Fake4 Street4 2000-01-03 Job6 Yes
CodePudding user response:
- Filter the Primary column from dataframe for 'yes':
df = df[df['Primary?']=='yes']
- Then, drop duplicates from the filtered dataframe.
df = df.drop_duplicates(subset= ['ID'])