Pyspark GroupBy time span-CodePudding

I have data with a start and end date e.g.

 --- ---------- ------------ 
| id|     start|         end|
 --- ---------- ------------ 
|  1|2021-05-01|  2022-02-01|
|  2|2021-10-01|  2021-12-01|
|  3|2021-11-01|  2022-01-01|
|  4|2021-06-01|  2021-10-01|
|  5|2022-01-01|  2022-02-01|
|  6|2021-08-01|  2021-12-01|
 --- ---------- ------------

I want a count for each month on how many observations were "active" in order to display that in a plot. With active I mean I want a count on how many observations have a start and end date that includes the given month. The result for the example data should look like this: Example of a plot for the active times

I have looked into the pyspark Window function, but I don't think that can help me with my problem. So far my only idea is to specify an extra column for each month in the data and indicate whether the observation is active in that month and work from there. But I feel like there must be a much more efficient way to do this.

CodePudding user response：

You can use sequence SQL. sequence will create the date range with start, end and interval and return the list.

Then, you can use explode to flatten the list and then count.

from pyspark.sql import functions as F

# Make sure your spark session is set to UTC.
# This SQL won't work well with a month interval if timezone is set to a place that has a daylight saving.
spark = (SparkSession
         .builder
         .config('spark.sql.session.timeZone', 'UTC')
         ... # other config
         .getOrCreate())

df = (df.withColumn('range', F.expr('sequence(to_date(`start`), to_date(`end`), interval 1 month) as date'))
      .withColumn('observation', F.explode('range')))
 
df = df.groupby('observation').count()