Home > Back-end >  Pandas Group by / deduplicate with condition
Pandas Group by / deduplicate with condition

Time:06-29

I have a DataFrame with duplicates. I'd like remove duplicates with groupby and a condition.

import pandas as pd
import numpy as np

df = pd.DataFrame({
               'id': [0, 1, 2, 3, 4, 5],
               'nm': ['A','A','A','B','B','B'],
               'Rev': ['$10','$20','$30','$40','$50','$60'],
               'Exp': ['$2','$4','$6','$8','$10','$12'],
               'Dt': ['2019-03-01', '2020-09-30', np.nan, '2021-09-30', '2022-04-01', ' ']
             })

Upon deduplication, I'd like to retain the row with most recent date.

So, for each group nm, retain the row with most recent date. Note that dates may be blank ' ' empty string or np.nan.

Expected Output:

id nm Rev Exp Dt  
1  A  $20 $4  2020-09-30
4  B  $50 $10 2022-04-01

CodePudding user response:

We need 1st convert the datetime to datetime object , then use sort_values drop_duplicates

df['Dt'] = pd.to_datetime(df['Dt'], errors = 'coerce')
out = df.sort_values('Dt',ascending=False).drop_duplicates('nm')
out
Out[231]: 
   id nm  Rev  Exp         Dt
4   4  B  $50  $10 2022-04-01
1   1  A  $20   $4 2020-09-30

CodePudding user response:

Try the following

df = df.sort_values(by="Dt", ascending=False).drop_duplicates('nm').sort_values('Dt').reset_index(drop=True)

Output

    id  nm  Rev Exp    Dt
0   1   A   $20 $4  2020-09-30
1   4   B   $50 $10 2022-04-01
  • Related