Calculate previous row within distinct groups-CodePudding

For every person and change in fruit, how could I create a boolean which works out if for that person and group the purchased column has been filled at some point above

df
   person  fruit    purchased  time     has_purchased_already_filled_in_for_the_group
0  amy     apple     stall     10:00    False (this is start of new group so nothing above)
1  amy     apple     counter   10:01    True (because stall been filled at 10:00)
2  amy     apple     store     10:01    True  (because stall and counter been filled above)
3  amy     banana    online    10:02    False (this is start of new group so nothing above)
4  amy     banana              10:03    True (because online filled)
5  amy     apple               10:04    True  (this is start of new group so nothing above)        
6  amy     apple   inperson    10:05    False (because the 10.04 apple purchase is not filled in)
7  ben  ...

I'm struggling how to tell Python that the beginning and end apple group is distinct because bananas was bought in between

CodePudding user response：

is there an issue with an iteration for this? -> still needs an addition if the person changes. You can also df.apply this or np.vectorize this as function.

gr = ''
    for ind, val in df['fruit']:
        if ind == 0:
            gr = val
            df.at[ind, dest_col] = False
            continue
        if val == gr:
            df.at[ind, dest_col] = True
            continue
        if val != gr:
            gr = val
            df.at[ind, dest_col] = False

CodePudding user response：

You could identify changed groups (person or fruit) and assign a group number to those and then do a groupby on that.

Identify changed groups and number them

chg_grp = ((df['person'] != df['person'].shift()) | (df['fruit'] != df['fruit'].shift())).cumsum()

Groupby those groups and set has_purchased

def b(x):
    rc = ~x.ffill().isna().shift(fill_value=False)
    rc.iloc[0] = False
    return rc

dft = df.assign(has_purchased=df.groupby(chg_grp)['purchased'].transform(b))
print(dft)

Result

  person   fruit purchased   time  has_p  has_purchased
0    amy   apple     stall  10:00  False          False
1    amy   apple   counter  10:01   True           True
2    amy   apple     store  10:01   True           True
3    amy  banana    online  10:02  False          False
4    amy  banana       NaN  10:03   True           True
5    amy   apple       NaN  10:04   True          False
6    amy   apple  inperson  10:05  False          False

Note that has_p was from your original dataframe example. At 10:04, has_p is different than has_purchased. I am assuming that a True value for 10:04 was in error given your problem definition.