For every person and change in fruit, how could I create a boolean which works out if for that person and group the purchased column has been filled at some point above
df
person fruit purchased time has_purchased_already_filled_in_for_the_group
0 amy apple stall 10:00 False (this is start of new group so nothing above)
1 amy apple counter 10:01 True (because stall been filled at 10:00)
2 amy apple store 10:01 True (because stall and counter been filled above)
3 amy banana online 10:02 False (this is start of new group so nothing above)
4 amy banana 10:03 True (because online filled)
5 amy apple 10:04 True (this is start of new group so nothing above)
6 amy apple inperson 10:05 False (because the 10.04 apple purchase is not filled in)
7 ben ...
I'm struggling how to tell Python that the beginning and end apple group is distinct because bananas was bought in between
CodePudding user response:
is there an issue with an iteration for this? -> still needs an addition if the person changes. You can also df.apply this or np.vectorize this as function.
gr = ''
for ind, val in df['fruit']:
if ind == 0:
gr = val
df.at[ind, dest_col] = False
continue
if val == gr:
df.at[ind, dest_col] = True
continue
if val != gr:
gr = val
df.at[ind, dest_col] = False
CodePudding user response:
You could identify changed groups (person or fruit) and assign a group number to those and then do a groupby
on that.
Identify changed groups and number them
chg_grp = ((df['person'] != df['person'].shift()) | (df['fruit'] != df['fruit'].shift())).cumsum()
Groupby those groups and set has_purchased
def b(x):
rc = ~x.ffill().isna().shift(fill_value=False)
rc.iloc[0] = False
return rc
dft = df.assign(has_purchased=df.groupby(chg_grp)['purchased'].transform(b))
print(dft)
Result
person fruit purchased time has_p has_purchased
0 amy apple stall 10:00 False False
1 amy apple counter 10:01 True True
2 amy apple store 10:01 True True
3 amy banana online 10:02 False False
4 amy banana NaN 10:03 True True
5 amy apple NaN 10:04 True False
6 amy apple inperson 10:05 False False
Note that has_p
was from your original dataframe example. At 10:04
, has_p
is different than has_purchased
. I am assuming that a True
value for 10:04
was in error given your problem definition.