Home > OS >  Python Groupby Separate Filter for Each Group
Python Groupby Separate Filter for Each Group

Time:12-23

Say I have the following data frame:

import pandas as pd

d = {'id':  [1, 2, 3, 3, 3, 2, 2, 1, 2, 3, 2, 3],
 'date': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
 'product': ['a', 'a', 'b', 'a', 'b', 'a', 'b', 'c', 'b', 'c', 'c', 'c']}

df = pd.DataFrame(d)

I want to keep all data for each ID the day of and after they bought product 'b' and get rid of all data before they bought product 'b'. ID 1 would have no data because they did not purchase the product, ID 2 would have data for the 3rd and 4th day, and ID 3 would have data for days 1-4.

I know that I could groupby id and then filter rows from individual groups but I can't figure out how to make the filter dynamic based on the group. I've tried looping through the groups but it's slow (right now I have 19,000 IDs but it'll only grow as I continue the project).

Any help would be greatly appreciated. Thank you!

CodePudding user response:

You can select the products "b" with eq and set the successive rows to True per group using groupby cummax. Then slice the dataframe

df[df['product'].eq('b').groupby(df['id']).cummax()]

output:

    id  date product
2    3     1       b
3    3     2       a
4    3     2       b
6    2     3       b
8    2     3       b
9    3     3       c
10   2     4       c
11   3     4       c

NB. this assumes the dataframe is ordered by date. If not use sort_values(by='date') (or by=['group', 'date'])

CodePudding user response:

Here's another solution: groupby custom function

def get_data(x):
    idx = np.where(x['product']=='b')[0]
    if idx.size:
        return x[idx[0]:]

out = df.groupby('id').apply(get_data).droplevel(0).sort_index()

Output:

    id  date product
2    3     1       b
3    3     2       a
4    3     2       b
6    2     3       b
8    2     3       b
9    3     3       c
10   2     4       c
11   3     4       c
  • Related