Home > Software design >  how do I filter rows that come before the row that contains certain value for each group in datafram
how do I filter rows that come before the row that contains certain value for each group in datafram

Time:12-27

How do I get only the rows that come after the 'click' in the column 'action_type' for each client_id the toy data.

df = pd.DataFrame({
  'user_client_id': [1,1, 1, 1, 1,1, 1,1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
   'timestamp':['2021-12-18 09:15:59', '2021-12-18 10:33:49', '2021-12-18 10:34:08',
'2021-12-18 10:34:09', '2021-12-18 10:57:02','2021-12-18 10:57:33','2021-12-18 10:58:01','2021-12-18 10:58:02','2021-12-18 10:58:17',
'2021-12-18 10:58:29','2021-12-18 10:58:31','2021-12-18 10:58:34', '2021-12-18 10:58:34','2021-12-18 10:58:47', '2021-12-18 10:59:12',
'2021-12-18 10:59:28','2021-12-18 10:59:35','2021-12-18 10:59:38','2021-12-18 11:05:13', '2021-12-18 11:05:58','2021-12-18 11:06:08','2021-12-18 11:06:10','2021-12-18 11:06:12','2021-12-18 11:07:42',
 '2021-12-18 11:10:07','2021-12-18 11:10:23', '2021-12-18 11:10:53', '2021-12-18 11:10:58', '2021-12-18 11:13:04', '2021-12-18 11:13:06',
'2021-12-18 14:56:32','2021-12-18 17:16:40'],
'action_type ': ['to_cart','to_cart','to_cart','to_cart','click', 'to_cart', 'to_cart', 'increment', 'remove', 'to_cart', 'increment', 'click', 'to_cart', 'increment', 'to_cart', 'to_cart', 'remove', 'to_cart', 'increment', 'to_cart', 'to_cart', 'click', 'increment',
 'to_cart', 'to_cart', 'to_cart', 'click', 'increment', 'to_cart', 'increment', 'to_cart', 'increment'] })

For the client with id 1 everything that comes before the click at 2021-12-18 10:57:02 should be filtered for the client with id 2 everything that comes before the click at 2021-12-18 11:06:10 should be filtered

I've tried this way, but it only works for the client 1, but doesn't work for the client 2

df.iloc[df.loc[df['action_type']=='click'].index[0]:,:]

CodePudding user response:

You can use a mask with groupby and cummax. This will set all values per group to True after the first "click'

m = (df['action_type'].eq('click')
       .groupby(df['user_client_id'])
       .cummax()
     )

df[m]

Output:

    user_client_id            timestamp action_type
4                1  2021-12-18 10:57:02       click
5                1  2021-12-18 10:57:33     to_cart
6                1  2021-12-18 10:58:01     to_cart
7                1  2021-12-18 10:58:02   increment
8                1  2021-12-18 10:58:17      remove
9                1  2021-12-18 10:58:29     to_cart
10               1  2021-12-18 10:58:31   increment
11               1  2021-12-18 10:58:34       click
12               1  2021-12-18 10:58:34     to_cart
13               1  2021-12-18 10:58:47   increment
14               1  2021-12-18 10:59:12     to_cart
21               2  2021-12-18 11:06:10       click
22               2  2021-12-18 11:06:12   increment
23               2  2021-12-18 11:07:42     to_cart
24               2  2021-12-18 11:10:07     to_cart
25               2  2021-12-18 11:10:23     to_cart
26               2  2021-12-18 11:10:53       click
27               2  2021-12-18 11:10:58   increment
28               2  2021-12-18 11:13:04     to_cart
29               2  2021-12-18 11:13:06   increment
30               2  2021-12-18 14:56:32     to_cart
31               2  2021-12-18 17:16:40   increment

CodePudding user response:

Anytime that you say "per client", it's a good sign that you need a groupby. As for filtering out the rows before the first click, you can count the cumulative number of clicks then get only rows where clicks > 0:

def filter(group):
    click = group['action_type'].eq('click').cumsum()
    return group[click > 0]

df.groupby('user_client_id').apply(filter).reset_index(level=0, drop=True)

CodePudding user response:

Use boolean mask:

m = df.groupby('user_client_id')['action_type'] \
      .apply(lambda x: x.eq('click').cumsum().astype(bool))

out = df[m]

Output:

>>> out
    user_client_id            timestamp action_type
4                1  2021-12-18 10:57:02       click
5                1  2021-12-18 10:57:33     to_cart
6                1  2021-12-18 10:58:01     to_cart
7                1  2021-12-18 10:58:02   increment
8                1  2021-12-18 10:58:17      remove
9                1  2021-12-18 10:58:29     to_cart
10               1  2021-12-18 10:58:31   increment
11               1  2021-12-18 10:58:34       click
12               1  2021-12-18 10:58:34     to_cart
13               1  2021-12-18 10:58:47   increment
14               1  2021-12-18 10:59:12     to_cart
21               2  2021-12-18 11:06:10       click
22               2  2021-12-18 11:06:12   increment
23               2  2021-12-18 11:07:42     to_cart
24               2  2021-12-18 11:10:07     to_cart
25               2  2021-12-18 11:10:23     to_cart
26               2  2021-12-18 11:10:53       click
27               2  2021-12-18 11:10:58   increment
28               2  2021-12-18 11:13:04     to_cart
29               2  2021-12-18 11:13:06   increment
30               2  2021-12-18 14:56:32     to_cart
31               2  2021-12-18 17:16:40   increment

Boolean mask:

>>> pd.concat([df, m], axis=1)
    user_client_id            timestamp  action_type  action_type
0                1  2021-12-18 09:15:59      to_cart        False
1                1  2021-12-18 10:33:49      to_cart        False
2                1  2021-12-18 10:34:08      to_cart        False
3                1  2021-12-18 10:34:09      to_cart        False
4                1  2021-12-18 10:57:02        click         True
5                1  2021-12-18 10:57:33      to_cart         True
6                1  2021-12-18 10:58:01      to_cart         True
7                1  2021-12-18 10:58:02    increment         True
8                1  2021-12-18 10:58:17       remove         True
9                1  2021-12-18 10:58:29      to_cart         True
10               1  2021-12-18 10:58:31    increment         True
11               1  2021-12-18 10:58:34        click         True
12               1  2021-12-18 10:58:34      to_cart         True
13               1  2021-12-18 10:58:47    increment         True
14               1  2021-12-18 10:59:12      to_cart         True
15               2  2021-12-18 10:59:28      to_cart        False
16               2  2021-12-18 10:59:35       remove        False
17               2  2021-12-18 10:59:38      to_cart        False
18               2  2021-12-18 11:05:13    increment        False
19               2  2021-12-18 11:05:58      to_cart        False
20               2  2021-12-18 11:06:08      to_cart        False
21               2  2021-12-18 11:06:10        click         True
22               2  2021-12-18 11:06:12    increment         True
23               2  2021-12-18 11:07:42      to_cart         True
24               2  2021-12-18 11:10:07      to_cart         True
25               2  2021-12-18 11:10:23      to_cart         True
26               2  2021-12-18 11:10:53        click         True
27               2  2021-12-18 11:10:58    increment         True
28               2  2021-12-18 11:13:04      to_cart         True
29               2  2021-12-18 11:13:06    increment         True
30               2  2021-12-18 14:56:32      to_cart         True
31               2  2021-12-18 17:16:40    increment         True

  • Related