I have a DataFrame with duplicates. I'd like remove duplicates with groupby
and a condition.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [0, 1, 2, 3, 4, 5],
'nm': ['A','A','A','B','B','B'],
'Rev': ['$10','$20','$30','$40','$50','$60'],
'Exp': ['$2','$4','$6','$8','$10','$12'],
'Dt': ['2019-03-01', '2020-09-30', np.nan, '2021-09-30', '2022-04-01', ' ']
})
Upon deduplication, I'd like to retain the row with most recent date.
So, for each group nm
, retain the row with most recent date. Note that dates may be blank ' '
empty string or np.nan
.
Expected Output:
id nm Rev Exp Dt
1 A $20 $4 2020-09-30
4 B $50 $10 2022-04-01
CodePudding user response:
We need 1st convert the datetime to datetime object , then use sort_values
drop_duplicates
df['Dt'] = pd.to_datetime(df['Dt'], errors = 'coerce')
out = df.sort_values('Dt',ascending=False).drop_duplicates('nm')
out
Out[231]:
id nm Rev Exp Dt
4 4 B $50 $10 2022-04-01
1 1 A $20 $4 2020-09-30
CodePudding user response:
Try the following
df = df.sort_values(by="Dt", ascending=False).drop_duplicates('nm').sort_values('Dt').reset_index(drop=True)
Output
id nm Rev Exp Dt
0 1 A $20 $4 2020-09-30
1 4 B $50 $10 2022-04-01