Home > Mobile >  How to make a new pandas column that's the average of the last 3 values?
How to make a new pandas column that's the average of the last 3 values?

Time:05-06

Let's say I have a dataframe with 3 columns, dt, unit, sold. What I would like to know how to do is how to create a new column called say, prior_3_avg, that is as the name suggests, an average of sold by unit for the past three same-day-of-week as dt. E.g., for unit "1" on May 5th 2020, what's the average it sold on April 28th, 21st, and 14th, which are the last three thursdays?

Toy sample data:

df = pd.DataFrame({'dt':['2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28','2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28',],'unit':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'sold':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28]})

df['dt'] = pd.to_datetime(df['dt'])

           dt  unit  sold 
0  2020-05-01     1     1
1  2020-05-02     1     2
2  2020-05-03     1     3
3  2020-05-04     1     4
4  2020-05-05     1     5
5  2020-05-06     1     6
...

How would I go about this? I've seen: Pandas new column from groupby averages

That explains how to just do a group by on the columns. I figure I could do a "day of week" column, but then I still have the same problem of wanting to limit to the past 3 matching day of week values instead of just all of the results.

It could possibly have something to do with this, but this looks more like it's useful for one-off analysis than making a new column: limit amount of rows as result of groupby Pandas

CodePudding user response:

This should work:

df['dayofweek'] = df['dt'].dt.dayofweek
df['output'] = df.apply(lambda x: df['sold'][(df.index < x.name) & (df.dayofweek == x.dayofweek)].tail(3).sum(), axis = 1)

CodePudding user response:

first create a new columns with the day

import pandas as pd
  
  
date = pd.date_range('2018-12-30', '2019-01-07',
                     freq='D').to_series()
date.dt.dayofweek

That will give you the number for the day and after you just need to filter with the month and sort the value

CodePudding user response:

Here is one idea: First group by unit, then group each unit by weekdays and get the rolling average for n weeks (with closed='left', the last n weeks excluding the current one are used for calculation, which seems to be what you want)...

n = 3
result = (df.groupby('unit')
          .apply(lambda f: (f['sold']
                            .groupby(f.dt.dt.day_name())
                            .rolling(n, closed='left')
                            .mean()
                           )
                )
          )

...which results in this series:

unit  dt           
1     Friday     0      NaN
                 7      NaN
                 14     NaN
                 21     8.0
      Monday     3      NaN
                 10     NaN
                 17     NaN
                 24    11.0
      ...
2     Friday     28     NaN
                 35     NaN
                 42     NaN
                 49     8.0
      Monday     31     NaN
                 38     NaN
                 45     NaN
                 52    11.0
      ...
Name: sold, dtype: float64

Next, get rid of the unit and time index levels, we don't need them. Also, rename the series for easier joining.

result = result.reset_index(level=[0, 1], drop=True)
result = result.rename('prior_3_avg')

Back to the mothership...

df2 = df.join(result)

Part of the final result in df2:

         time  unit  sold  prior_3_avg
... # first 21 are NaN
21 2020-05-22     1    22          8.0
22 2020-05-23     1    23          9.0
23 2020-05-24     1    24         10.0
24 2020-05-25     1    25         11.0
25 2020-05-26     1    26         12.0
26 2020-05-27     1    27         13.0
27 2020-05-28     1    28         14.0
  • Related