I want to keep the latest rows with the same ID and also the rows that match certain column values. Sample Input:
ID Timestamp Survey Outcome
12 11/26/2021 INCOMPLETE Survey
95 11/26/2021 INCOMPLETE Survey
95 11/27/2021 COMPLETE Survey
95 11/28/2021 RANG-But did not connect
12 11/29/2021 COMPLETE Survey
24 11/26/2021 RANG-But did not connect
24 11/27/2021 INCOMPLETE Survey
95 11/28/2021 RANG-But did not connect
24 11/28/2021 INCOMPLETE Survey
Here ID 12 has two values, so I'll keep the latest(11/29/2021) row. But for ID 95, once the survey is complete it can't have any other options like rang-but did not connect. So I want to keep the latest timestamps data and also keep those rows where once the data is complete survey but the latest data shows incomplete survey or did not connect (all data after seeing COMPLETE SURVEY).
So my sample output will be:
ID Timestamp Survey Outcome
95 11/27/2021 COMPLETE Survey
95 11/28/2021 RANG-But did not connect
12 11/29/2021 COMPLETE Survey
95 11/28/2021 RANG-But did not connect
24 11/28/2021 INCOMPLETE Survey```
CodePudding user response:
You can use:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df.sort_values(by=['ID', 'Timestamp']).reset_index(drop=True, inplace=True)
df = df.groupby('ID').apply(lambda x: x.loc[x[x['Survey Outcome'] == 'COMPLETE Survey'].index[0]: ] if
x['Survey Outcome'].isin(['COMPLETE Survey']).any() else x.loc[x['Timestamp'].idxmax():]).reset_index(drop=True)
print(df)
OUTPUT
ID Timestamp Survey Outcome
0 12 2021-11-29 COMPLETE Survey
1 24 2021-11-28 INCOMPLETE Survey
2 95 2021-11-27 COMPLETE Survey
3 95 2021-11-28 RANG-But did not connect
4 95 2021-11-28 RANG-But did not connect
CodePudding user response:
Use DataFrame.sort_values
by ID
and Timestamp
first, then use GroupBy.cummax
for all values after COMPLETE Survey
and add last ID
not matched by isin
with DataFrame.drop_duplicates
:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.sort_values(['ID','Timestamp'])
m = df['Survey Outcome'].eq('COMPLETE Survey')
df1 = df[m.groupby(df['ID']).cummax()]
df2 = df.drop_duplicates('ID', keep='last')
df = df1.append(df2[~df2['ID'].isin(df1['ID'])]).sort_index()
print (df)
ID Timestamp Survey Outcome
2 95 2021-11-27 COMPLETE Survey
3 95 2021-11-28 RANG-But did not connect
4 12 2021-11-29 COMPLETE Survey
7 95 2021-11-28 RANG-But did not connect
8 24 2021-11-28 INCOMPLETE Survey