Home > Back-end >  filtering the DataFrame based on condition with Date Column
filtering the DataFrame based on condition with Date Column

Time:09-21

Existing Dataframe :

UserId     Activity_date     Status  
A            27-01-2022      initiated        
A            27-01-2022      In_process
A            27-01-2022      Success    
A            28-01-2022      initiated  
A            28-01-2022      Completed
A            30-01-2022      initiated
A            30-01-2022      failed
B            26-01-2022      initiated
B            26-01-2022      Completed
B            26-01-2022      Success
B            27-01-2022      initiated

Need to drop those entries where Success or Completed does not appear for the Same Activity Date. as in case of UserId A , Success status appears on the 27-01-2022 and Completed status appears on 28-01-2022 , so the new dataframe should all the rows for the Activity date 27-01-2022 and 28-01-2022

Expected Dataframe :

UserId     Activity_date     Status  
A            27-01-2022      initiated        
A            27-01-2022      In_process
A            27-01-2022      Success    
A            28-01-2022      initiated  
A            28-01-2022      Completed
B            26-01-2022      initiated
B            26-01-2022      Completed
B            26-01-2022      Success

i approached it with grouping by the UserId and Activity_date and checking the row for 'Success' or 'Completed' status , but stucked with dropping part.

CodePudding user response:

try:

df
    UserId  Activity_date   Status
0   A       27-01-2022  initiated
1   A       27-01-2022  In_process
2   A       27-01-2022  Success
3   A       28-01-2022  initiated
4   A       28-01-2022  Completed
5   A       30-01-2022  initiated
6   A       30-01-2022  failed
7   B       26-01-2022  initiated
8   B       26-01-2022  Completed
9   B       26-01-2022  Success
10  B       27-01-2022  initiated

df1 = df.groupby(['UserId','Activity_date'])['Status'].apply(list).reset_index()

df1
    UserId  Activity_date   Status
0   A       27-01-2022      [initiated, In_process, Success]
1   A       28-01-2022      [initiated, Completed]
2   A       30-01-2022      [initiated, failed]
3   B       26-01-2022      [initiated, Completed, Success]
4   B       27-01-2022      [initiated]

df1.loc[df1['Status'].map(lambda x: ("Success" in x) and ('Completed' in x))].explode('Status')
    UserId  Activity_date   Status
3   B       26-01-2022      initiated
3   B       26-01-2022      Completed
3   B       26-01-2022      Success

CodePudding user response:

df being your database, I suggest this code:

df1 = df[df.Status.isin(['Success','Completed'])][['UserId','Activity_date']].drop_duplicates()
df1.merge(df, how='left',on=['UserId','Activity_date'])

df1:

        UserId  Activity_date
2   A   2022-01-27
4   A   2022-01-28
8   B   2022-01-26

CodePudding user response:

Here's a solution using df.pivot.

out = df.pivot(index=['UserId','Activity_date'], 
               columns='Status', values='Status')

out = df.set_index(['UserId','Activity_date'])\
    .loc[out[['Completed','Success']].any(axis=1)].reset_index(drop=False)

print(out)

  UserId Activity_date      Status
0      A    27-01-2022   initiated
1      A    27-01-2022  In_process
2      A    27-01-2022     Success
3      A    28-01-2022   initiated
4      A    28-01-2022   Completed
5      B    26-01-2022   initiated
6      B    26-01-2022   Completed
7      B    26-01-2022     Success

Explanation:

  • The pivot gets us columns for each Status value per UserId, Activity_date pair with either NaN or the appropriate value (e.g. Success).
  • Next, we use df.any on ['Completed','Success'] to check if at least one of either columns contains a non-NaN value. The result is a pd.Series with booleans.
  • Finally, we use this series to select the correct rows from the original df with UserId, Activity_date temporarily set as the index.
  • Related