I have a df sorted by person and time. The index is not duplicated, nor is it continuous from 0. I check the difference in time against a threshold depending on row above
person time_bought product
42 abby 2:21 fruit
12 abby 2:55 fruit
10 abby 10:35 other
3 barry 12:00 fruit
...
thresh = {'fruit': pd.Timedelta('10min'), 'other': pd.Timedelta('2min')}
# map custom threshold based on previous row product
ref = df.groupby('person')['product'].shift().map(thresh)
I don't understand why m1 does not retain df's index: I get a sorted index from the lowest index value onwards.
# compare each delta to the custom threshold.
m1 = df.loc[df.product=="fruit", 'time_bought'].groupby(df['person']).diff().gt(ref)
3 False
4 False
If I remove .gt(ref)
, I only see the filtered rows and original index is retained.
df.loc[df.product=="fruit", 'time_bought'].groupby(df['person']).diff()
42 NaT
12 0 Days 00:34:00
...
A sorted index messes up my next line:m1.cumsum()
CodePudding user response:
Use:
df['time_bought'] = pd.to_timedelta('00:' df['time_bought'])
Idea is not filter rows, but set NaT
to unmatched rows:
print (df['time_bought'].where(df['product']=="fruit", None))
42 0 days 00:02:21
12 0 days 00:02:55
10 NaT
3 0 days 00:12:00
Name: time_bought, dtype: timedelta64[ns]
So original order is not changed:
m1 = (df['time_bought'].where(df['product']=="fruit", None)
.groupby(df['person']).diff().gt(ref))
print (m1)
42 False
12 False
10 False
3 False
dtype: bool
CodePudding user response:
Introductory remarks:
Unreliable assumption about the previous row. Are you relying on time, despite the indexes in reverse order?
df.product
is not column as you hope, but a method of DataFrame object.I guess converting to Datetime for a time column:
df['time_bought'] = pd.to_datetime(df['time_bought'], format='%H:%M')
In general you want to do per-person analysis, so do not repeat grouping in your code, but operate on a single group a time.
GroupBy iterator is just a sequence of subtables for an every person. Do your operations on a "personal" table. To begin with, practice on a subsample...
_df = df[df['person']=='abby']
ref = _df['product'].shift().map(thresh).fillna(thresh['other'])
ref
42 0 days 00:02:00
12 0 days 00:10:00
10 0 days 00:10:00
m1 = _df.loc[_df['product']=="fruit", 'time_bought'].diff().gt(ref)
m1.sum()
1
Is it all right?
So the concise solution is:
def count_purchases(_df, product='other'):
ref = _df['product'].shift().map(thresh).fillna(thresh['other'])
m1 = _df.loc[_df['product']==product, 'time_bought'].diff().gt(ref)
return m1.sum()
df.groupby('person').apply(count_purchases, 'fruit')
person
abby 1
barry 0
dtype: int64