pandas groupby comparing string value with previous row value and spot changes in new columns-CodePudding

I have this demo df:

info = {'customer': ['Jason', 'Jason', 'Jason', 'Jason',
                     'Molly', 'Molly', 'Molly', 'Molly'], 
'Good': ['Cookie', 'Cookie', 'Cookie', 'Cookie','Ice Cream', 
         'Ice Cream', 'Ice Cream', 'Ice Cream'],
'Date' :['2021-12-14','2022-01-04','2022-01-11','2022-01-18',
         '2022-01-12','2022-01-15','2022-01-19','2022-01-30'],
'Flavor' :['Chocolate','Vanilla','Vanilla','Strawberry',
           'Chocolate', 'Vanilla', 'Caramel', 'Caramel']}
df = pd.DataFrame(data=info)
df

gives:

   customer   Good      Date        Flavor
0   Jason   Cookie      2021-12-14  Chocolate
1   Jason   Cookie      2022-01-04  Vanilla
2   Jason   Cookie      2022-01-11  Vanilla
3   Jason   Cookie      2022-01-18  Strawberry
4   Molly   Ice Cream   2022-01-12  Chocolate
5   Molly   Ice Cream   2022-01-15  Vanilla
6   Molly   Ice Cream   2022-01-19  Caramel
7   Molly   Ice Cream   2022-01-30  Caramel

I am trying to track the change in flavor per customer per Goods in new columns From - To. I did the grouping part:

   df.sort_values(['Date']).groupby(['customer','Good','Date'])['Flavor'].sum()

I got:

 customer  Good       Date      
    Jason     Cookie     2021-12-14     Chocolate
                         2022-01-04       Vanilla
                         2022-01-11       Vanilla
                         2022-01-18    Strawberry
    Molly     Ice Cream  2022-01-12     Chocolate
                         2022-01-15       Vanilla
                         2022-01-19       Caramel
                         2022-01-30       Caramel
    Name: Flavor, dtype: object

The first row per group is the entry point and then I would like to compare the next change per group, if it is different then we track the change in new columns (from & to) and if similar values nothing happens.

I have tried multiple methods & codes but unfortunately I do not know the best way to do so.

Expected output considering reset_index():

  customer   Good      Date        Flavor           From         To
0   Jason   Cookie      2021-12-14  Chocolate    
1   Jason   Cookie      2022-01-04  Vanilla         Chocolate    Vanilla
2   Jason   Cookie      2022-01-11  Vanilla
3   Jason   Cookie      2022-01-18  Strawberry      Vanilla      Strawberry
4   Molly   Ice Cream   2022-01-12  Chocolate
5   Molly   Ice Cream   2022-01-15  Vanilla         Chocolate    Vanilla
6   Molly   Ice Cream   2022-01-19  Caramel         Vanilla      Caramel
7   Molly   Ice Cream   2022-01-30  Caramel

CodePudding user response：

s=df.assign(
             
             
             From = df.sort_values(by='Date').groupby(['customer',  'Good'])['Flavor'].apply(lambda x: x.shift(1)),
             To = df['Flavor']
).dropna()

out = df.join(s[s['From'] != s['To']].iloc[:,-2:]).fillna('')




   customer       Good        Date      Flavor       From          To
0    Jason     Cookie  2021-12-14   Chocolate                       
1    Jason     Cookie  2022-01-04     Vanilla  Chocolate     Vanilla
2    Jason     Cookie  2022-01-11     Vanilla                       
3    Jason     Cookie  2022-01-18  Strawberry    Vanilla  Strawberry
4    Molly  Ice Cream  2022-01-12   Chocolate                       
5    Molly  Ice Cream  2022-01-15     Vanilla  Chocolate     Vanilla
6    Molly  Ice Cream  2022-01-19     Caramel    Vanilla     Caramel
7    Molly  Ice Cream  2022-01-30     Caramel

CodePudding user response：

Building on the sum you've created (named g), we can groupby the first 2 levels of the index and shift it, then join it back to g. After rename-ing columns, mask "To" and "From" columns depending on if there was any change or if it's NaN. Finally, join this back to the DataFrame:

g = df.sort_values(['Date']).groupby(['customer','Good','Date'])['Flavor'].sum()
joined = g.to_frame().assign(To=g).join(g.groupby(level=[0,1]).shift().to_frame(), lsuffix='', rsuffix='_').rename(columns={'Flavor_':'From'})
joined.update(joined[['To','From']].mask(joined['From'].isna() | joined['From'].eq(joined['To']), ''))
out = joined[['Flavor','From','To']].reset_index()

Output:

  customer       Good        Date      Flavor       From          To
0    Jason     Cookie  2021-12-14   Chocolate                       
1    Jason     Cookie  2022-01-04     Vanilla  Chocolate     Vanilla
2    Jason     Cookie  2022-01-11     Vanilla                       
3    Jason     Cookie  2022-01-18  Strawberry    Vanilla  Strawberry
4    Molly  Ice Cream  2022-01-12   Chocolate                       
5    Molly  Ice Cream  2022-01-15     Vanilla  Chocolate     Vanilla
6    Molly  Ice Cream  2022-01-19     Caramel    Vanilla     Caramel
7    Molly  Ice Cream  2022-01-30     Caramel