I have the following data frame.
ID Date1 Date2
1 7-12-2021 20-11-2021
1 10-11-2021 01-12-2021
2 22-10-2021 03-12-2021
My idea is based on duplicate value of the ID
column compare the two dates and keep the row if Date2
is earlier than Date1
. If the value of ID
is unique, no need to do comparison and keep the value as it is.
I would like to get the following output.
ID Date1 Date2
1 10-11-2021 01-12-2021
2 22-10-2021 03-12-2021
I have tried this like the following but not succeed.
df = df.groupby(['ID'])[(df['Date1']) < (df['Date2'])]
Can any one help me with this?
CodePudding user response:
I would first start with making sure your Date columns are of datetime type, and then check for duplicates in the ID column and whether Date2 precedes Date1 and drop if that's the case:
# Convert to datetime
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
# Mark what you need to drop
df.loc[df.ID.duplicated(keep=False),'ind'] = 'dup'
df['ind'] = np.where((df.ind.eq('dup')) & (df['Date2'] > df['Date1']),'Drop','Keep')
>>> print(df.loc[df['ind'].eq('Keep')].drop('ind',axis=1))
ID Date1 Date2
1 1 2021-10-11 2021-01-12
2 2 2021-10-22 2021-03-12
CodePudding user response:
You can create a dummy variable Keep
to compare dates, create a mask
for duplicate values and use boolean indexing:
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
df['Keep'] = np.where(df['Date1']>df['Date2'],True,False)
mask = df['ID'].isin(df['ID'].value_counts() > 1)
mask = (mask&df['Keep']) | (~mask&df['Keep'])
out = df[mask].drop('Keep', axis=1)
Output:
ID Date1 Date2
1 1 2021-10-11 2021-01-12
2 2 2021-10-22 2021-03-12