Python Groupby Separate Filter for Each Group-CodePudding

Say I have the following data frame:

import pandas as pd

d = {'id':  [1, 2, 3, 3, 3, 2, 2, 1, 2, 3, 2, 3],
 'date': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
 'product': ['a', 'a', 'b', 'a', 'b', 'a', 'b', 'c', 'b', 'c', 'c', 'c']}

df = pd.DataFrame(d)

I want to keep all data for each ID the day of and after they bought product 'b' and get rid of all data before they bought product 'b'. ID 1 would have no data because they did not purchase the product, ID 2 would have data for the 3rd and 4th day, and ID 3 would have data for days 1-4.

I know that I could groupby id and then filter rows from individual groups but I can't figure out how to make the filter dynamic based on the group. I've tried looping through the groups but it's slow (right now I have 19,000 IDs but it'll only grow as I continue the project).

Any help would be greatly appreciated. Thank you!

CodePudding user response：

You can select the products "b" with eq and set the successive rows to True per group using groupby cummax. Then slice the dataframe

df[df['product'].eq('b').groupby(df['id']).cummax()]

output:

    id  date product
2    3     1       b
3    3     2       a
4    3     2       b
6    2     3       b
8    2     3       b
9    3     3       c
10   2     4       c
11   3     4       c

NB. this assumes the dataframe is ordered by date. If not use sort_values(by='date') (or by=['group', 'date'])

CodePudding user response：

Here's another solution: groupby custom function

def get_data(x):
    idx = np.where(x['product']=='b')[0]
    if idx.size:
        return x[idx[0]:]

out = df.groupby('id').apply(get_data).droplevel(0).sort_index()

Output:

    id  date product
2    3     1       b
3    3     2       a
4    3     2       b
6    2     3       b
8    2     3       b
9    3     3       c
10   2     4       c
11   3     4       c