Home > Back-end >  Pandas keep the latest rows for the same ID with some conditional column values
Pandas keep the latest rows for the same ID with some conditional column values

Time:12-14

I want to keep the latest rows with the same ID and also the rows that match certain column values. Sample Input:

ID          Timestamp       Survey Outcome
12          11/26/2021      INCOMPLETE Survey
95          11/26/2021      INCOMPLETE Survey
95          11/27/2021      COMPLETE Survey
95          11/28/2021      RANG-But did not connect
12          11/29/2021      COMPLETE Survey
24          11/26/2021      RANG-But did not connect
24          11/27/2021      INCOMPLETE Survey
95          11/28/2021      RANG-But did not connect
24          11/28/2021      INCOMPLETE Survey

Here ID 12 has two values, so I'll keep the latest(11/29/2021) row. But for ID 95, once the survey is complete it can't have any other options like rang-but did not connect. So I want to keep the latest timestamps data and also keep those rows where once the data is complete survey but the latest data shows incomplete survey or did not connect (all data after seeing COMPLETE SURVEY).

So my sample output will be:

ID          Timestamp       Survey Outcome
95          11/27/2021      COMPLETE Survey
95          11/28/2021      RANG-But did not connect
12          11/29/2021      COMPLETE Survey
95          11/28/2021      RANG-But did not connect
24          11/28/2021      INCOMPLETE Survey```


CodePudding user response:

You can use:

df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df.sort_values(by=['ID', 'Timestamp']).reset_index(drop=True, inplace=True)
df = df.groupby('ID').apply(lambda x: x.loc[x[x['Survey Outcome'] == 'COMPLETE Survey'].index[0]: ] if
                            x['Survey Outcome'].isin(['COMPLETE Survey']).any() else x.loc[x['Timestamp'].idxmax():]).reset_index(drop=True)
print(df)

OUTPUT

   ID  Timestamp            Survey Outcome
0  12 2021-11-29           COMPLETE Survey
1  24 2021-11-28         INCOMPLETE Survey
2  95 2021-11-27           COMPLETE Survey
3  95 2021-11-28  RANG-But did not connect
4  95 2021-11-28  RANG-But did not connect

CodePudding user response:

Use DataFrame.sort_values by ID and Timestamp first, then use GroupBy.cummax for all values after COMPLETE Survey and add last ID not matched by isin with DataFrame.drop_duplicates:

df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.sort_values(['ID','Timestamp'])

m = df['Survey Outcome'].eq('COMPLETE Survey')

df1 = df[m.groupby(df['ID']).cummax()]
df2 = df.drop_duplicates('ID', keep='last')

df = df1.append(df2[~df2['ID'].isin(df1['ID'])]).sort_index()

print (df)
   ID  Timestamp            Survey Outcome
2  95 2021-11-27           COMPLETE Survey
3  95 2021-11-28  RANG-But did not connect
4  12 2021-11-29           COMPLETE Survey
7  95 2021-11-28  RANG-But did not connect
8  24 2021-11-28         INCOMPLETE Survey
  • Related