Pandas index is sorting on its own-CodePudding

I have a df sorted by person and time. The index is not duplicated, nor is it continuous from 0. I check the difference in time against a threshold depending on row above

   person   time_bought  product    
42 abby     2:21         fruit        
12 abby     2:55         fruit      
10 abby     10:35        other
3  barry    12:00        fruit      
...

thresh = {'fruit': pd.Timedelta('10min'), 'other': pd.Timedelta('2min')}
# map custom threshold based on previous row product
ref = df.groupby('person')['product'].shift().map(thresh)

I don't understand why m1 does not retain df's index: I get a sorted index from the lowest index value onwards.

# compare each delta to the custom threshold. 
m1 = df.loc[df.product=="fruit", 'time_bought'].groupby(df['person']).diff().gt(ref)

3 False
4 False

If I remove .gt(ref), I only see the filtered rows and original index is retained.

df.loc[df.product=="fruit", 'time_bought'].groupby(df['person']).diff()

42 NaT
12 0 Days 00:34:00
...

A sorted index messes up my next line:m1.cumsum()

CodePudding user response：

Use:

df['time_bought'] = pd.to_timedelta('00:'   df['time_bought'])

Idea is not filter rows, but set NaT to unmatched rows:

print (df['time_bought'].where(df['product']=="fruit", None))
42   0 days 00:02:21
12   0 days 00:02:55
10               NaT
3    0 days 00:12:00
Name: time_bought, dtype: timedelta64[ns]

So original order is not changed:

m1 = (df['time_bought'].where(df['product']=="fruit", None)
                       .groupby(df['person']).diff().gt(ref))
print (m1)
42    False
12    False
10    False
3     False
dtype: bool

CodePudding user response：

Introductory remarks:

Unreliable assumption about the previous row. Are you relying on time, despite the indexes in reverse order?
df.product is not column as you hope, but a method of DataFrame object.
I guess converting to Datetime for a time column:

df['time_bought'] = pd.to_datetime(df['time_bought'], format='%H:%M')

In general you want to do per-person analysis, so do not repeat grouping in your code, but operate on a single group a time.

GroupBy iterator is just a sequence of subtables for an every person. Do your operations on a "personal" table. To begin with, practice on a subsample...

_df = df[df['person']=='abby']

ref = _df['product'].shift().map(thresh).fillna(thresh['other'])
ref

42   0 days 00:02:00
12   0 days 00:10:00
10   0 days 00:10:00

m1 = _df.loc[_df['product']=="fruit", 'time_bought'].diff().gt(ref)
m1.sum()

1

Is it all right?

So the concise solution is:

def count_purchases(_df, product='other'):
    ref = _df['product'].shift().map(thresh).fillna(thresh['other'])
    m1 = _df.loc[_df['product']==product, 'time_bought'].diff().gt(ref)
    return m1.sum()

df.groupby('person').apply(count_purchases, 'fruit')

person
abby     1
barry    0
dtype: int64