Group by a column and compare dates: Pandas-CodePudding

I have the following data frame.

ID  Date1        Date2
1   7-12-2021    20-11-2021
1   10-11-2021   01-12-2021
2   22-10-2021   03-12-2021

My idea is based on duplicate value of the ID column compare the two dates and keep the row if Date2 is earlier than Date1. If the value of ID is unique, no need to do comparison and keep the value as it is.

I would like to get the following output.

ID  Date1        Date2
1   10-11-2021   01-12-2021
2   22-10-2021   03-12-2021

I have tried this like the following but not succeed.

df = df.groupby(['ID'])[(df['Date1']) < (df['Date2'])]

Can any one help me with this?

CodePudding user response：

I would first start with making sure your Date columns are of datetime type, and then check for duplicates in the ID column and whether Date2 precedes Date1 and drop if that's the case:

# Convert to datetime
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])

# Mark what you need to drop
df.loc[df.ID.duplicated(keep=False),'ind'] = 'dup'
df['ind'] = np.where((df.ind.eq('dup')) & (df['Date2'] > df['Date1']),'Drop','Keep')

>>> print(df.loc[df['ind'].eq('Keep')].drop('ind',axis=1))

  ID      Date1      Date2
1   1 2021-10-11 2021-01-12
2   2 2021-10-22 2021-03-12

CodePudding user response：

You can create a dummy variable Keep to compare dates, create a mask for duplicate values and use boolean indexing:

df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])

df['Keep'] = np.where(df['Date1']>df['Date2'],True,False)
mask = df['ID'].isin(df['ID'].value_counts() > 1)
mask = (mask&df['Keep']) | (~mask&df['Keep'])
out = df[mask].drop('Keep', axis=1)

Output:

    ID  Date1   Date2
1   1   2021-10-11  2021-01-12
2   2   2021-10-22  2021-03-12