Home > front end >  calculate difference between rows for the same product
calculate difference between rows for the same product

Time:09-17

I am using Python and Pandas to make some calculations over a dataframe that contains millions of rows like the following ones:

city    product     month       sold
Milan   Spaghetti   2021-03-31  30300
Milan   Spaghetti   2021-06-30  26958
Milan   Spaghetti   2021-09-30  26775
Milan   Spaghetti   2021-12-31  44185
Milan   Spaghetti   2022-03-31  32716
Milan   Spaghetti   2022-06-30  25881
Milan   Maccheroni  2021-03-31  89584
Milan   Maccheroni  2021-06-30  81434
Milan   Maccheroni  2021-09-30  83360
Milan   Maccheroni  2021-12-31  123945
Milan   Maccheroni  2022-03-31  97278
Milan   Maccheroni  2022-06-30  82959
Rome    Spaghetti   2021-01-31  1524
Rome    Spaghetti   2021-04-30  1548
Rome    Spaghetti   2021-07-31  1577
Rome    Spaghetti   2021-10-31  1438
Rome    Spaghetti   2022-01-31  1556
Rome    Spaghetti   2022-04-30  1471
Rome    Spaghetti   2022-07-31  1453
Rome    Maccheroni  2021-01-31  15646
Rome    Maccheroni  2021-04-30  15877
Rome    Maccheroni  2021-07-31  15289
Rome    Maccheroni  2021-10-31  16675
Rome    Maccheroni  2022-01-31  17028
Rome    Maccheroni  2022-04-30  16490
Rome    Maccheroni  2022-07-31  14664

I am looking for a way to calculate the changes between subsequent months for the same city and product, so that the resulting dataframe is like that:

city    product     month       sold    change
Milan   Spaghetti   31/03/2021  30300   
Milan   Spaghetti   30/06/2021  26958   -3342
Milan   Spaghetti   30/09/2021  26775   -183
Milan   Spaghetti   31/12/2021  44185   17410
Milan   Spaghetti   31/03/2022  32716   -11469
Milan   Spaghetti   30/06/2022  25881   -6835
Milan   Maccheroni  31/03/2021  89584   
Milan   Maccheroni  30/06/2021  81434   -8150
Milan   Maccheroni  30/09/2021  83360   1926
Milan   Maccheroni  31/12/2021  123945  40585
Milan   Maccheroni  31/03/2022  97278   -26667
Milan   Maccheroni  30/06/2022  82959   -14319
Rome    Spaghetti   31/01/2021  1524    
Rome    Spaghetti   30/04/2021  1548    24
Rome    Spaghetti   31/07/2021  1577    29
Rome    Spaghetti   31/10/2021  1438    -139
Rome    Spaghetti   31/01/2022  1556    118
Rome    Spaghetti   30/04/2022  1471    -85
Rome    Spaghetti   31/07/2022  1453    -18
Rome    Maccheroni  31/01/2021  15646   
Rome    Maccheroni  30/04/2021  15877   231
Rome    Maccheroni  31/07/2021  15289   -588
Rome    Maccheroni  31/10/2021  16675   1386
Rome    Maccheroni  31/01/2022  17028   353
Rome    Maccheroni  30/04/2022  16490   -538
Rome    Maccheroni  31/07/2022  14664   -1826

The code shod calculate the change between two rows only if the city and product field are the same. Is it possible to do it without iterating on the rows ?

Please ignore the change in the month format, it is irrelevant for the solution.

CodePudding user response:

This is easy to vectorize this with the shift function. You just have to make sure you:

  1. Sort by categories, date in that order (which it looks like you have)
  2. Mark as NaN the first month for every category after the fact.

The code would look something like this:

category_cols = ["city", "product"]
df = df.sort_values(category_cols   ["month"])

# Get difference
df["change"] = df.sold - df.sold.shift(1)

# Get new categories, mark missing
new_cat_mask = (df[category_cols] != df[category_cols].shift(1)).any(axis=1)
df.loc[new_cat_mask, "change"] = np.nan

CodePudding user response:

This is what you are looking for.

difference_df = (df
 .assign(difference=lambda x: x.groupby(['city', 'product'])['sold'].transform(lambda x: x.diff()))
)
  • Related