Home > other >  Group by range of dates from date_start to date_end columns
Group by range of dates from date_start to date_end columns

Time:12-29

I have a table with following table structure:

place_id            date_start       date_end
2826088480774       2017-09-19       2017-09-20
1898375544837       2017-08-01       2017-08-03
1425929142277       2017-09-23       2017-10-03
1013612281863       2016-10-12       2016-10-14
1795296329731       2016-10-13       2016-10-13
695784701956        2017-09-11       2017-11-02

I want to count how many events (each row is an event) was in every place by each month. If event dates refer to several months, it should be counted for all affected months.

place_id could be repeated, so I did the following query:

Select place_id, EXTRACT(MONTH FROM date_start) as 
month, EXTRACT(YEAR FROM date_start) as year, 
COUNT(*) as events
From Table
Group by place_id, year, month
Order by month, year, events desc

So I get following grouped table:

place_id         month      year   events
2826088480774       8       2017     345
1898375544837       8       2017     343
1425929142277       8       2017     344
1013612281863       8       2017     355
1795296329731       8       2017     348
695784701956        8       2017     363

Problem is that data is grouped only by start_date, and its not clear for me how to distribute data by all affected months range from date_start to date_end.

CodePudding user response:

You can use sequence function to generate the dates between date_start and date_end then explode the generated array column and group by and count as you already did:

df.createOrReplaceTempView("EventsTable")

spark.sql("""
    WITH events AS (
        SELECT  place_id, 
                explode(event_dates) as event_date
        FROM    (
            SELECT  place_id, 
                    sequence(date_start, date_end, interval 1 day) as event_dates
            FROM    EventsTable
        )
    )
    
    SELECT  place_id, 
            month(event_date) as month, 
            year(event_date)  as year,
            count(*)          as events
    FROM    events
    GROUP BY 1, 2, 3
    ORDER BY month, year, events desc
""").show()

// ------------- ----- ---- ------ 
//|     place_id|month|year|events|
// ------------- ----- ---- ------ 
//|1898375544837|    8|2017|     3|
//|695784701956 |    9|2017|    20|
//|1425929142277|    9|2017|     8|
//|2826088480774|    9|2017|     2|
//|1013612281863|   10|2016|     3|
//|1795296329731|   10|2016|     1|
//|695784701956 |   10|2017|    31|
//|1425929142277|   10|2017|     3|
//|695784701956 |   11|2017|     2|
// ------------- ----- ---- ------ 
  • Related