I have this demo df
:
info = {'customer': ['Jason', 'Jason', 'Jason', 'Jason',
'Molly', 'Molly', 'Molly', 'Molly'],
'Good': ['Cookie', 'Cookie', 'Cookie', 'Cookie','Ice Cream',
'Ice Cream', 'Ice Cream', 'Ice Cream'],
'Date' :['2021-12-14','2022-01-04','2022-01-11','2022-01-18',
'2022-01-12','2022-01-15','2022-01-19','2022-01-30'],
'Flavor' :['Chocolate','Vanilla','Vanilla','Strawberry',
'Chocolate', 'Vanilla', 'Caramel', 'Caramel']}
df = pd.DataFrame(data=info)
df
gives:
customer Good Date Flavor
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla
6 Molly Ice Cream 2022-01-19 Caramel
7 Molly Ice Cream 2022-01-30 Caramel
I am trying to track the change in flavor per customer per Goods in new columns From
- To
. I did the grouping part:
df.sort_values(['Date']).groupby(['customer','Good','Date'])['Flavor'].sum()
I got:
customer Good Date
Jason Cookie 2021-12-14 Chocolate
2022-01-04 Vanilla
2022-01-11 Vanilla
2022-01-18 Strawberry
Molly Ice Cream 2022-01-12 Chocolate
2022-01-15 Vanilla
2022-01-19 Caramel
2022-01-30 Caramel
Name: Flavor, dtype: object
The first row per group is the entry point and then I would like to compare the next change per group, if it is different then we track the change in new columns (from & to) and if similar values nothing happens.
I have tried multiple methods & codes but unfortunately I do not know the best way to do so.
Expected output considering reset_index()
:
customer Good Date Flavor From To
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla Chocolate Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry Vanilla Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla Chocolate Vanilla
6 Molly Ice Cream 2022-01-19 Caramel Vanilla Caramel
7 Molly Ice Cream 2022-01-30 Caramel
CodePudding user response:
s=df.assign(
From = df.sort_values(by='Date').groupby(['customer', 'Good'])['Flavor'].apply(lambda x: x.shift(1)),
To = df['Flavor']
).dropna()
out = df.join(s[s['From'] != s['To']].iloc[:,-2:]).fillna('')
customer Good Date Flavor From To
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla Chocolate Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry Vanilla Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla Chocolate Vanilla
6 Molly Ice Cream 2022-01-19 Caramel Vanilla Caramel
7 Molly Ice Cream 2022-01-30 Caramel
CodePudding user response:
Building on the sum
you've created (named g
), we can groupby
the first 2 levels of the index and shift
it, then join
it back to g
. After rename
-ing columns, mask
"To" and "From" columns depending on if there was any change or if it's NaN. Finally, join
this back to the DataFrame:
g = df.sort_values(['Date']).groupby(['customer','Good','Date'])['Flavor'].sum()
joined = g.to_frame().assign(To=g).join(g.groupby(level=[0,1]).shift().to_frame(), lsuffix='', rsuffix='_').rename(columns={'Flavor_':'From'})
joined.update(joined[['To','From']].mask(joined['From'].isna() | joined['From'].eq(joined['To']), ''))
out = joined[['Flavor','From','To']].reset_index()
Output:
customer Good Date Flavor From To
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla Chocolate Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry Vanilla Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla Chocolate Vanilla
6 Molly Ice Cream 2022-01-19 Caramel Vanilla Caramel
7 Molly Ice Cream 2022-01-30 Caramel