Say I have the following data frame:
import pandas as pd
d = {'id': [1, 2, 3, 3, 3, 2, 2, 1, 2, 3, 2, 3],
'date': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'product': ['a', 'a', 'b', 'a', 'b', 'a', 'b', 'c', 'b', 'c', 'c', 'c']}
df = pd.DataFrame(d)
I want to keep all data for each ID the day of and after they bought product 'b' and get rid of all data before they bought product 'b'. ID 1 would have no data because they did not purchase the product, ID 2 would have data for the 3rd and 4th day, and ID 3 would have data for days 1-4.
I know that I could groupby id and then filter rows from individual groups but I can't figure out how to make the filter dynamic based on the group. I've tried looping through the groups but it's slow (right now I have 19,000 IDs but it'll only grow as I continue the project).
Any help would be greatly appreciated. Thank you!
CodePudding user response:
You can select the products "b" with eq
and set the successive rows to True per group using groupby
cummax
. Then slice the dataframe
df[df['product'].eq('b').groupby(df['id']).cummax()]
output:
id date product
2 3 1 b
3 3 2 a
4 3 2 b
6 2 3 b
8 2 3 b
9 3 3 c
10 2 4 c
11 3 4 c
NB. this assumes the dataframe is ordered by date. If not use sort_values(by='date')
(or by=['group', 'date']
)
CodePudding user response:
Here's another solution: groupby
custom function
def get_data(x):
idx = np.where(x['product']=='b')[0]
if idx.size:
return x[idx[0]:]
out = df.groupby('id').apply(get_data).droplevel(0).sort_index()
Output:
id date product
2 3 1 b
3 3 2 a
4 3 2 b
6 2 3 b
8 2 3 b
9 3 3 c
10 2 4 c
11 3 4 c