I am using Python and Pandas to make some calculations over a dataframe that contains millions of rows like the following ones:
city product month sold
Milan Spaghetti 2021-03-31 30300
Milan Spaghetti 2021-06-30 26958
Milan Spaghetti 2021-09-30 26775
Milan Spaghetti 2021-12-31 44185
Milan Spaghetti 2022-03-31 32716
Milan Spaghetti 2022-06-30 25881
Milan Maccheroni 2021-03-31 89584
Milan Maccheroni 2021-06-30 81434
Milan Maccheroni 2021-09-30 83360
Milan Maccheroni 2021-12-31 123945
Milan Maccheroni 2022-03-31 97278
Milan Maccheroni 2022-06-30 82959
Rome Spaghetti 2021-01-31 1524
Rome Spaghetti 2021-04-30 1548
Rome Spaghetti 2021-07-31 1577
Rome Spaghetti 2021-10-31 1438
Rome Spaghetti 2022-01-31 1556
Rome Spaghetti 2022-04-30 1471
Rome Spaghetti 2022-07-31 1453
Rome Maccheroni 2021-01-31 15646
Rome Maccheroni 2021-04-30 15877
Rome Maccheroni 2021-07-31 15289
Rome Maccheroni 2021-10-31 16675
Rome Maccheroni 2022-01-31 17028
Rome Maccheroni 2022-04-30 16490
Rome Maccheroni 2022-07-31 14664
I am looking for a way to calculate the changes between subsequent months for the same city and product, so that the resulting dataframe is like that:
city product month sold change
Milan Spaghetti 31/03/2021 30300
Milan Spaghetti 30/06/2021 26958 -3342
Milan Spaghetti 30/09/2021 26775 -183
Milan Spaghetti 31/12/2021 44185 17410
Milan Spaghetti 31/03/2022 32716 -11469
Milan Spaghetti 30/06/2022 25881 -6835
Milan Maccheroni 31/03/2021 89584
Milan Maccheroni 30/06/2021 81434 -8150
Milan Maccheroni 30/09/2021 83360 1926
Milan Maccheroni 31/12/2021 123945 40585
Milan Maccheroni 31/03/2022 97278 -26667
Milan Maccheroni 30/06/2022 82959 -14319
Rome Spaghetti 31/01/2021 1524
Rome Spaghetti 30/04/2021 1548 24
Rome Spaghetti 31/07/2021 1577 29
Rome Spaghetti 31/10/2021 1438 -139
Rome Spaghetti 31/01/2022 1556 118
Rome Spaghetti 30/04/2022 1471 -85
Rome Spaghetti 31/07/2022 1453 -18
Rome Maccheroni 31/01/2021 15646
Rome Maccheroni 30/04/2021 15877 231
Rome Maccheroni 31/07/2021 15289 -588
Rome Maccheroni 31/10/2021 16675 1386
Rome Maccheroni 31/01/2022 17028 353
Rome Maccheroni 30/04/2022 16490 -538
Rome Maccheroni 31/07/2022 14664 -1826
The code shod calculate the change between two rows only if the city and product field are the same. Is it possible to do it without iterating on the rows ?
Please ignore the change in the month format, it is irrelevant for the solution.
CodePudding user response:
This is easy to vectorize this with the shift
function. You just have to make sure you:
- Sort by categories, date in that order (which it looks like you have)
- Mark as
NaN
the first month for every category after the fact.
The code would look something like this:
category_cols = ["city", "product"]
df = df.sort_values(category_cols ["month"])
# Get difference
df["change"] = df.sold - df.sold.shift(1)
# Get new categories, mark missing
new_cat_mask = (df[category_cols] != df[category_cols].shift(1)).any(axis=1)
df.loc[new_cat_mask, "change"] = np.nan
CodePudding user response:
This is what you are looking for.
difference_df = (df
.assign(difference=lambda x: x.groupby(['city', 'product'])['sold'].transform(lambda x: x.diff()))
)