I know there are lot of questions about removing duplicates from pandas dataframe but this is bit different.
I am trying to remove duplicates from the dataframe but not getting the actual output as in the below given result dataframe. Actually the data in table is too long. For understanding purpose I have given the dummy data here in the table.
Condition:-
I need to remove duplicates and get the rows that contain max value from diast column.
Is there a good way to get result dataframe using given df.
Any help would be appreciated. Thanks :)
DF:-
age | syst | diast | a | b | c | d |
---|---|---|---|---|---|---|
29 | 90 | 57 | MO | MO | 0 | MO |
29 | 90 | 58 | MO | MO | 0 | MO |
29 | 90 | 59 | MO | MO | 0 | MO |
29 | 90 | 60 | MO | MO | 0 | MO |
29 | 90 | 61 | 0 | 0 | 0 | 0 |
29 | 90 | 62 | 0 | 0 | 0 | 0 |
29 | 90 | 63 | 0 | 0 | 0 | 0 |
29 | 90 | 64 | 0 | 0 | 0 | 0 |
29 | 90 | 65 | MO | MO | 0 | MO |
29 | 90 | 66 | MO | MO | 0 | MO |
29 | 90 | 67 | MO | MO | 0 | MO |
29 | 90 | 68 | MO | MO | 0 | MO |
Result:-
age | syst | diast | a | b | c | d |
---|---|---|---|---|---|---|
29 | 90 | 60 | MO | MO | 0 | MO |
29 | 90 | 64 | 0 | 0 | 0 | 0 |
29 | 90 | 68 | MO | MO | 0 | MO |
CodePudding user response:
can you try this:
df['id']=df.groupby(['age', 'syst', 'a', 'b', 'c', 'd']).ngroup()
df['id2']=df['id'].shift(-1)
df2=df.drop_duplicates(subset=['age', 'syst', 'a', 'b', 'c', 'd','id','id2'],keep=False).drop(['id','id2'],axis=1)
print(df2)
'''
age syst diast a b c d
3 29 90 60 MO MO 0 MO
7 29 90 64 0 0 0 0
11 29 90 68 MO MO 0 MO
'''
CodePudding user response:
# create a flag to separate out the group based on column 'a'
# a is the only column that distinguishes the group
df['flag'] = np.nan
df['flag']=df['flag'].mask(df['a'].ne(df['a'].shift()), 1).cumsum().ffill()
# sort, drop duplicates, keep flag as one of the column
# finally drop the flag column
(df.sort_values(['age','syst','diast'])
.drop_duplicates(subset=['age','syst', 'a','b','c','d','flag'], keep='last')
.drop(columns='flag'))
age syst diast a b c d
3 29 90 60 MO MO 0 MO
7 29 90 64 0 0 0 0
11 29 90 68 MO MO 0 MO