Home > Back-end >  Pyspark calculating 12 months moving average within groups
Pyspark calculating 12 months moving average within groups

Time:12-08

I am currently using Pyspark to do a moving average calculation for the last 12 months for different company group. The data looks like this:

| CALENDAR_DATE| COMPANY | VALUE
| 2021-11-01   | a       |  31
| 2021-10-01   | a       |  31
| 2021-09-01   | a       |  33
| 2021-08-01   | a       |  21
| 2021-07-01   | a       |  25
| 2021-06-01   | a       |  28
| 2021-05-01   | a       |  31
| 2021-04-01   | a       |  31
| 2021-03-01   | a       |  33
| 2021-04-01   | a       |  31
| 2021-03-01   | a       |  33
| 2021-04-01   | a       |  10
| 2021-03-01   | a       |  25
| 2021-04-01   | a       |  30
| 2021-03-01   | a       |  27
| 2021-02-01   | a       |  18
| 2021-01-01   | a       |  15
| 2021-11-01   | b       |  31
| 2021-10-01   | b       |  30
| 2021-09-01   | b       |  31
| 2021-08-01   | b       |  32

and I would like to get an extra column called rolling_average for each company a and b. my code looks like this and it doesn't give me the right answer. I really don't know what is the problem.

from pyspark.sql.functions import *
from pyspark.sql.window import *

w = Window().partitionBy('COMPANY').orderBy('CALENDAR_DATE').rowsBetween(-11, 0)
df = df.withColumn('ROLLING_AVERAGE', round(avg('VALUE').over(w), 1))

CodePudding user response:

You need to use Window rangeBetween instead of rowsBetween. But before convert the CALENDAR_DATE column into timestamp:

from pyspark.sql import Window
from pyspark.sql import functions as F

df = df.withColumn('calendar_timestamp', F.to_timestamp('CALENDAR_DATE').cast("long"))

# 2629800 is the number of seconds in one month
w = Window().partitionBy('COMPANY').orderBy('calendar_timestamp').rangeBetween(-11 * 2629800, 0)

df1 = df.withColumn(
    'ROLLING_AVERAGE',
    F.round(F.avg('VALUE').over(w), 1)
).drop('calendar_timestamp')

df1.show()
# ------------- ------- ----- --------------- 
#|CALENDAR_DATE|COMPANY|VALUE|ROLLING_AVERAGE|
# ------------- ------- ----- --------------- 
#|   2021-08-01|      b|   32|           32.0|
#|   2021-09-01|      b|   31|           31.5|
#|   2021-10-01|      b|   30|           31.0|
#|   2021-11-01|      b|   31|           31.0|
#|   2021-01-01|      a|   15|           15.0|
#|   2021-02-01|      a|   18|           16.5|
#|   2021-03-01|      a|   33|           25.2|
#|   2021-03-01|      a|   33|           25.2|
#|   2021-03-01|      a|   25|           25.2|
#|   2021-03-01|      a|   27|           25.2|
#|   2021-04-01|      a|   31|           25.3|
#|   2021-04-01|      a|   31|           25.3|
#|   2021-04-01|      a|   10|           25.3|
#|   2021-04-01|      a|   30|           25.3|
#|   2021-05-01|      a|   31|           25.8|
#|   2021-06-01|      a|   28|           26.0|
#|   2021-07-01|      a|   25|           25.9|
#|   2021-08-01|      a|   21|           25.6|
#|   2021-09-01|      a|   33|           26.1|
#|   2021-10-01|      a|   31|           26.4|
# ------------- ------- ----- --------------- 
  • Related