My idea is to make a combination of userid
and shopid
where their status is cancelled
, and filter the original data frame using this combination, but is it possible to do that?
df_final1[['userid', 'shopid', 'status',
'created_time', 'cancel_time']].head()
userid shopid status created_time \
1254 248390148 631980847 Paid 2019-01-10 01:08:00
2220 362408494 720477845 Paid 2019-01-10 01:19:00
4843 665442742 724056488 Paid 2019-01-10 01:25:00
378 144816680 550081468 Paid 2019-01-10 02:33:00
1526 278521124 240583029 Paid then Cancelled 2019-01-10 02:37:00
cancel_time
1254 NaT
2220 NaT
4843 NaT
378 NaT
1526 2019-10-02 13:26:00
df_final1[['userid', 'shopid', 'status',
'created_time', 'cancel_time']].loc[df_final1['status']
.str.contains('cancelled', case=False)].head()
userid shopid status created_time \
1526 278521124 240583029 Paid then Cancelled 2019-01-10 02:37:00
1918 326681402 429387935 Unpaid and Cancelled 2019-01-10 05:05:00
6233 818601412 439306870 Paid then Cancelled 2019-01-10 05:45:00
3485 502346098 843003698 Unpaid and Cancelled 2019-01-10 08:04:00
2682 414657587 998402920 Unpaid and Cancelled 2019-01-10 08:20:00
cancel_time
1526 2019-10-02 13:26:00
1918 2019-10-02 17:09:00
6233 2019-10-03 23:32:00
3485 2019-10-01 20:29:00
2682 2019-10-02 12:59:00
Okay so after some tries i got this, so now the problem is how should i filter the users who have repurchased from the same shop after cancelled?
print(repurchase_true[['userid', 'shopid', 'status']])
userid shopid status
154 555157016 469831329 Paid
715 690156368 954422341 Paid
791 650971451 624341114 Paid
2036 195170292 173105930 Paid then Cancelled
2040 195170292 173105930 Paid then Cancelled
2777 650971451 624341114 Unpaid and Cancelled
3587 652903759 315658623 Paid
3749 650971451 624341114 Paid then Cancelled
3961 106380451 383755396 Paid
4065 114564157 788847405 Paid then Cancelled
4068 650971451 579205102 Paid
4117 652903759 771950408 Unpaid and Cancelled
4345 106380451 383755396 Paid then Cancelled
4384 106380451 383755396 Paid
4608 690156368 954422341 Paid
4750 555157016 469831329 Unpaid and Cancelled
4887 108425329 626078476 Paid then Cancelled
4923 108425329 626078476 Paid
5099 195170292 173105930 Paid then Cancelled
5198 481323800 154396809 Unpaid and Cancelled
5227 555157016 843003698 Paid
5288 481323800 154396809 Paid
5577 195170292 843114676 Paid
5675 195170292 843114676 Paid
6190 652903759 315658623 Unpaid and Cancelled
6239 481323800 154396809 Paid
6395 555157016 843003698 Paid then Cancelled
6634 690156368 954422341 Unpaid and Cancelled
6934 114564157 788847405 Paid then Cancelled
CodePudding user response:
Try this:
Creating a dataframe with status as Cancelled (as done by you):
df_cancel = df[['userid', 'shopid', 'status',
'created_time', 'cancel_time']].loc[df['status']
.str.contains('cancelled', case=False)]
Similarly, creating a df with status as not cancelled (here I used cancel_time
to get this, you may use anything that works with you):
df_active = df[['userid', 'shopid', 'status',
'created_time', 'cancel_time']].loc[df['cancel_time'] == 'NaT']
Check if everything is good using df_cancel.head()
and df_active.head()
Renaming the columns, so that it would be easy to filter:
df_cancel.rename(columns = {'status':'c_status', 'created_time':'c_created_time',
'cancel_time':'c_cancel_time'}, inplace = True)
df_active.rename(columns = {'status':'a_status', 'created_time':'a_created_time',
'cancel_time':'a_cancel_time'}, inplace = True)
Merging both the dataframes on userid
and shopid
:
merged = pd.merge(df_active, df_cancel, how="inner", left_on=['userid', 'shopid'], right_on=['userid', 'shopid']);
Filtering the merged dataframe to find purchase date after the cancel_date
which is the desired result:
df_final = merged[['userid', 'shopid', 'a_status',
'a_created_time', 'a_cancel_time']].loc[merged['a_created_time'] > merged['c_cancel_time']]
df_final.head()
Note: You may add more columns to the df_final
or alter any conditions used above based on your requirement.