Existing Dataframe :
UserId Activity_date Status
A 27-01-2022 initiated
A 27-01-2022 In_process
A 27-01-2022 Success
A 28-01-2022 initiated
A 28-01-2022 Completed
A 30-01-2022 initiated
A 30-01-2022 failed
B 26-01-2022 initiated
B 26-01-2022 Completed
B 26-01-2022 Success
B 27-01-2022 initiated
Need to drop those entries where Success or Completed does not appear for the Same Activity Date. as in case of UserId A , Success status appears on the 27-01-2022 and Completed status appears on 28-01-2022 , so the new dataframe should all the rows for the Activity date 27-01-2022 and 28-01-2022
Expected Dataframe :
UserId Activity_date Status
A 27-01-2022 initiated
A 27-01-2022 In_process
A 27-01-2022 Success
A 28-01-2022 initiated
A 28-01-2022 Completed
B 26-01-2022 initiated
B 26-01-2022 Completed
B 26-01-2022 Success
i approached it with grouping by the UserId and Activity_date and checking the row for 'Success' or 'Completed' status , but stucked with dropping part.
CodePudding user response:
try:
df
UserId Activity_date Status
0 A 27-01-2022 initiated
1 A 27-01-2022 In_process
2 A 27-01-2022 Success
3 A 28-01-2022 initiated
4 A 28-01-2022 Completed
5 A 30-01-2022 initiated
6 A 30-01-2022 failed
7 B 26-01-2022 initiated
8 B 26-01-2022 Completed
9 B 26-01-2022 Success
10 B 27-01-2022 initiated
df1 = df.groupby(['UserId','Activity_date'])['Status'].apply(list).reset_index()
df1
UserId Activity_date Status
0 A 27-01-2022 [initiated, In_process, Success]
1 A 28-01-2022 [initiated, Completed]
2 A 30-01-2022 [initiated, failed]
3 B 26-01-2022 [initiated, Completed, Success]
4 B 27-01-2022 [initiated]
df1.loc[df1['Status'].map(lambda x: ("Success" in x) and ('Completed' in x))].explode('Status')
UserId Activity_date Status
3 B 26-01-2022 initiated
3 B 26-01-2022 Completed
3 B 26-01-2022 Success
CodePudding user response:
df being your database, I suggest this code:
df1 = df[df.Status.isin(['Success','Completed'])][['UserId','Activity_date']].drop_duplicates()
df1.merge(df, how='left',on=['UserId','Activity_date'])
df1:
UserId Activity_date
2 A 2022-01-27
4 A 2022-01-28
8 B 2022-01-26
CodePudding user response:
Here's a solution using df.pivot
.
out = df.pivot(index=['UserId','Activity_date'],
columns='Status', values='Status')
out = df.set_index(['UserId','Activity_date'])\
.loc[out[['Completed','Success']].any(axis=1)].reset_index(drop=False)
print(out)
UserId Activity_date Status
0 A 27-01-2022 initiated
1 A 27-01-2022 In_process
2 A 27-01-2022 Success
3 A 28-01-2022 initiated
4 A 28-01-2022 Completed
5 B 26-01-2022 initiated
6 B 26-01-2022 Completed
7 B 26-01-2022 Success
Explanation:
- The pivot gets us columns for each
Status
value perUserId, Activity_date
pair with eitherNaN
or the appropriate value (e.g.Success
). - Next, we use
df.any
on['Completed','Success']
to check if at least one of either columns contains a non-NaN value. The result is apd.Series
with booleans. - Finally, we use this series to select the correct rows from the original
df
withUserId, Activity_date
temporarily set as the index.