How can I find the average of every nth number of rows in PySpark-CodePudding

I have 1440 rows in my dataframe (one row for every minute of the day). I want to convert this into hours so that I have 24 values (rows) left in total.

This is a 2 column dataframe. First column is minutes, second column is integers. I would like a 2 X 24 dataframe where the first column is hours and the second column is an average of 60 values.

CodePudding user response：

If your minutes column is an integer starting at 0, something along these lines should work:

hour = F.floor(F.col('minute') / 60).alias('hour')
df = df.groupBy(hour).agg(F.avg('integer').alias('average'))

The example where I assume that every hour has 3 minutes:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(0, 5),
     (1, 5),
     (2, 5),
     (3, 0),
     (4, 0),
     (5, 1)],
    ['minute', 'integer'])

hour = F.floor(F.col('minute') / 3).alias('hour')
df = df.groupBy(hour).agg(F.avg('integer').alias('average'))

df.show()
#  ---- ------------------ 
# |hour|           average|
#  ---- ------------------ 
# |   0|               5.0|
# |   1|0.3333333333333333|
#  ---- ------------------