How do I get only the rows that come after the 'click' in the column 'action_type' for each client_id the toy data.
df = pd.DataFrame({
'user_client_id': [1,1, 1, 1, 1,1, 1,1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
'timestamp':['2021-12-18 09:15:59', '2021-12-18 10:33:49', '2021-12-18 10:34:08',
'2021-12-18 10:34:09', '2021-12-18 10:57:02','2021-12-18 10:57:33','2021-12-18 10:58:01','2021-12-18 10:58:02','2021-12-18 10:58:17',
'2021-12-18 10:58:29','2021-12-18 10:58:31','2021-12-18 10:58:34', '2021-12-18 10:58:34','2021-12-18 10:58:47', '2021-12-18 10:59:12',
'2021-12-18 10:59:28','2021-12-18 10:59:35','2021-12-18 10:59:38','2021-12-18 11:05:13', '2021-12-18 11:05:58','2021-12-18 11:06:08','2021-12-18 11:06:10','2021-12-18 11:06:12','2021-12-18 11:07:42',
'2021-12-18 11:10:07','2021-12-18 11:10:23', '2021-12-18 11:10:53', '2021-12-18 11:10:58', '2021-12-18 11:13:04', '2021-12-18 11:13:06',
'2021-12-18 14:56:32','2021-12-18 17:16:40'],
'action_type ': ['to_cart','to_cart','to_cart','to_cart','click', 'to_cart', 'to_cart', 'increment', 'remove', 'to_cart', 'increment', 'click', 'to_cart', 'increment', 'to_cart', 'to_cart', 'remove', 'to_cart', 'increment', 'to_cart', 'to_cart', 'click', 'increment',
'to_cart', 'to_cart', 'to_cart', 'click', 'increment', 'to_cart', 'increment', 'to_cart', 'increment'] })
For the client with id 1 everything that comes before the click at 2021-12-18 10:57:02 should be filtered for the client with id 2 everything that comes before the click at 2021-12-18 11:06:10 should be filtered
I've tried this way, but it only works for the client 1, but doesn't work for the client 2
df.iloc[df.loc[df['action_type']=='click'].index[0]:,:]
CodePudding user response:
You can use a mask with groupby
and cummax
. This will set all values per group to True after the first "click'
m = (df['action_type'].eq('click')
.groupby(df['user_client_id'])
.cummax()
)
df[m]
Output:
user_client_id timestamp action_type
4 1 2021-12-18 10:57:02 click
5 1 2021-12-18 10:57:33 to_cart
6 1 2021-12-18 10:58:01 to_cart
7 1 2021-12-18 10:58:02 increment
8 1 2021-12-18 10:58:17 remove
9 1 2021-12-18 10:58:29 to_cart
10 1 2021-12-18 10:58:31 increment
11 1 2021-12-18 10:58:34 click
12 1 2021-12-18 10:58:34 to_cart
13 1 2021-12-18 10:58:47 increment
14 1 2021-12-18 10:59:12 to_cart
21 2 2021-12-18 11:06:10 click
22 2 2021-12-18 11:06:12 increment
23 2 2021-12-18 11:07:42 to_cart
24 2 2021-12-18 11:10:07 to_cart
25 2 2021-12-18 11:10:23 to_cart
26 2 2021-12-18 11:10:53 click
27 2 2021-12-18 11:10:58 increment
28 2 2021-12-18 11:13:04 to_cart
29 2 2021-12-18 11:13:06 increment
30 2 2021-12-18 14:56:32 to_cart
31 2 2021-12-18 17:16:40 increment
CodePudding user response:
Anytime that you say "per client", it's a good sign that you need a groupby
. As for filtering out the rows before the first click, you can count the cumulative number of clicks then get only rows where clicks > 0:
def filter(group):
click = group['action_type'].eq('click').cumsum()
return group[click > 0]
df.groupby('user_client_id').apply(filter).reset_index(level=0, drop=True)
CodePudding user response:
Use boolean mask
:
m = df.groupby('user_client_id')['action_type'] \
.apply(lambda x: x.eq('click').cumsum().astype(bool))
out = df[m]
Output:
>>> out
user_client_id timestamp action_type
4 1 2021-12-18 10:57:02 click
5 1 2021-12-18 10:57:33 to_cart
6 1 2021-12-18 10:58:01 to_cart
7 1 2021-12-18 10:58:02 increment
8 1 2021-12-18 10:58:17 remove
9 1 2021-12-18 10:58:29 to_cart
10 1 2021-12-18 10:58:31 increment
11 1 2021-12-18 10:58:34 click
12 1 2021-12-18 10:58:34 to_cart
13 1 2021-12-18 10:58:47 increment
14 1 2021-12-18 10:59:12 to_cart
21 2 2021-12-18 11:06:10 click
22 2 2021-12-18 11:06:12 increment
23 2 2021-12-18 11:07:42 to_cart
24 2 2021-12-18 11:10:07 to_cart
25 2 2021-12-18 11:10:23 to_cart
26 2 2021-12-18 11:10:53 click
27 2 2021-12-18 11:10:58 increment
28 2 2021-12-18 11:13:04 to_cart
29 2 2021-12-18 11:13:06 increment
30 2 2021-12-18 14:56:32 to_cart
31 2 2021-12-18 17:16:40 increment
Boolean mask:
>>> pd.concat([df, m], axis=1)
user_client_id timestamp action_type action_type
0 1 2021-12-18 09:15:59 to_cart False
1 1 2021-12-18 10:33:49 to_cart False
2 1 2021-12-18 10:34:08 to_cart False
3 1 2021-12-18 10:34:09 to_cart False
4 1 2021-12-18 10:57:02 click True
5 1 2021-12-18 10:57:33 to_cart True
6 1 2021-12-18 10:58:01 to_cart True
7 1 2021-12-18 10:58:02 increment True
8 1 2021-12-18 10:58:17 remove True
9 1 2021-12-18 10:58:29 to_cart True
10 1 2021-12-18 10:58:31 increment True
11 1 2021-12-18 10:58:34 click True
12 1 2021-12-18 10:58:34 to_cart True
13 1 2021-12-18 10:58:47 increment True
14 1 2021-12-18 10:59:12 to_cart True
15 2 2021-12-18 10:59:28 to_cart False
16 2 2021-12-18 10:59:35 remove False
17 2 2021-12-18 10:59:38 to_cart False
18 2 2021-12-18 11:05:13 increment False
19 2 2021-12-18 11:05:58 to_cart False
20 2 2021-12-18 11:06:08 to_cart False
21 2 2021-12-18 11:06:10 click True
22 2 2021-12-18 11:06:12 increment True
23 2 2021-12-18 11:07:42 to_cart True
24 2 2021-12-18 11:10:07 to_cart True
25 2 2021-12-18 11:10:23 to_cart True
26 2 2021-12-18 11:10:53 click True
27 2 2021-12-18 11:10:58 increment True
28 2 2021-12-18 11:13:04 to_cart True
29 2 2021-12-18 11:13:06 increment True
30 2 2021-12-18 14:56:32 to_cart True
31 2 2021-12-18 17:16:40 increment True