Home > Software engineering >  Remove duplicate rows from pandas dataframe using specific condition
Remove duplicate rows from pandas dataframe using specific condition

Time:10-20

I know there are lot of questions about removing duplicates from pandas dataframe but this is bit different.

I am trying to remove duplicates from the dataframe but not getting the actual output as in the below given result dataframe. Actually the data in table is too long. For understanding purpose I have given the dummy data here in the table.

Condition:-

I need to remove duplicates and get the rows that contain max value from diast column.

Is there a good way to get result dataframe using given df.

Any help would be appreciated. Thanks :)

DF:-

age syst diast a b c d
29 90 57 MO MO 0 MO
29 90 58 MO MO 0 MO
29 90 59 MO MO 0 MO
29 90 60 MO MO 0 MO
29 90 61 0 0 0 0
29 90 62 0 0 0 0
29 90 63 0 0 0 0
29 90 64 0 0 0 0
29 90 65 MO MO 0 MO
29 90 66 MO MO 0 MO
29 90 67 MO MO 0 MO
29 90 68 MO MO 0 MO

Result:-

age syst diast a b c d
29 90 60 MO MO 0 MO
29 90 64 0 0 0 0
29 90 68 MO MO 0 MO

CodePudding user response:

can you try this:

df['id']=df.groupby(['age', 'syst', 'a', 'b', 'c', 'd']).ngroup()
df['id2']=df['id'].shift(-1)

df2=df.drop_duplicates(subset=['age', 'syst', 'a', 'b', 'c', 'd','id','id2'],keep=False).drop(['id','id2'],axis=1)
print(df2)
'''
    age  syst  diast   a   b  c   d
3    29    90     60  MO  MO  0  MO
7    29    90     64   0   0  0   0
11   29    90     68  MO  MO  0  MO
'''

CodePudding user response:

# create a flag to separate out the group based on column 'a'
# a is the only column that distinguishes the group
df['flag'] = np.nan
df['flag']=df['flag'].mask(df['a'].ne(df['a'].shift()), 1).cumsum().ffill()

# sort, drop duplicates, keep flag as one of the column
# finally drop the flag column
(df.sort_values(['age','syst','diast'])
 .drop_duplicates(subset=['age','syst', 'a','b','c','d','flag'], keep='last')
 .drop(columns='flag'))
    age     syst    diast   a   b   c   d
3    29     90         60   MO  MO  0   MO
7    29     90         64   0   0   0   0
11   29     90         68   MO  MO  0   MO
  • Related