Home > other >  How can I find the average of every nth number of rows in PySpark
How can I find the average of every nth number of rows in PySpark

Time:07-28

I have 1440 rows in my dataframe (one row for every minute of the day). I want to convert this into hours so that I have 24 values (rows) left in total.

This is a 2 column dataframe. First column is minutes, second column is integers. I would like a 2 X 24 dataframe where the first column is hours and the second column is an average of 60 values.

CodePudding user response:

If your minutes column is an integer starting at 0, something along these lines should work:

hour = F.floor(F.col('minute') / 60).alias('hour')
df = df.groupBy(hour).agg(F.avg('integer').alias('average'))

The example where I assume that every hour has 3 minutes:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(0, 5),
     (1, 5),
     (2, 5),
     (3, 0),
     (4, 0),
     (5, 1)],
    ['minute', 'integer'])

hour = F.floor(F.col('minute') / 3).alias('hour')
df = df.groupBy(hour).agg(F.avg('integer').alias('average'))

df.show()
#  ---- ------------------ 
# |hour|           average|
#  ---- ------------------ 
# |   0|               5.0|
# |   1|0.3333333333333333|
#  ---- ------------------ 
  • Related