I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID | Date |
---|---|
1 | 2020-09-03 |
1 | 2020-09-03 |
2 | 2020-09-02 |
1 | 2020-09-04 |
2 | 2020-09-06 |
2 | 2020-09-16 |
the needed outcome for this example will be:
ID | average difference |
---|---|
1 | 0.5 |
2 | 7 |
thanks for helping!
CodePudding user response:
You can use datediff
with window function to calculate the difference, then take an average.
lag
is one of the window function and it will take a value from the previous row within the window.
from pyspark.sql import functions as F
# define the window
w = Window.partitionBy('ID').orderBy('Date')
# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
.groupby('ID') # aggregate over ID
.agg(F.avg(F.col('diff')).alias('average difference'))
)