I have 1440 rows in my dataframe (one row for every minute of the day). I want to convert this into hours so that I have 24 values (rows) left in total.
This is a 2 column dataframe. First column is minutes, second column is integers. I would like a 2 X 24 dataframe where the first column is hours and the second column is an average of 60 values.
CodePudding user response:
If your minutes column is an integer starting at 0, something along these lines should work:
hour = F.floor(F.col('minute') / 60).alias('hour')
df = df.groupBy(hour).agg(F.avg('integer').alias('average'))
The example where I assume that every hour has 3 minutes:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(0, 5),
(1, 5),
(2, 5),
(3, 0),
(4, 0),
(5, 1)],
['minute', 'integer'])
hour = F.floor(F.col('minute') / 3).alias('hour')
df = df.groupBy(hour).agg(F.avg('integer').alias('average'))
df.show()
# ---- ------------------
# |hour| average|
# ---- ------------------
# | 0| 5.0|
# | 1|0.3333333333333333|
# ---- ------------------