I have a table with following table structure:
place_id date_start date_end
2826088480774 2017-09-19 2017-09-20
1898375544837 2017-08-01 2017-08-03
1425929142277 2017-09-23 2017-10-03
1013612281863 2016-10-12 2016-10-14
1795296329731 2016-10-13 2016-10-13
695784701956 2017-09-11 2017-11-02
I want to count how many events (each row is an event) was in every place by each month. If event dates refer to several months, it should be counted for all affected months.
place_id
could be repeated, so I did the following query:
Select place_id, EXTRACT(MONTH FROM date_start) as
month, EXTRACT(YEAR FROM date_start) as year,
COUNT(*) as events
From Table
Group by place_id, year, month
Order by month, year, events desc
So I get following grouped table:
place_id month year events
2826088480774 8 2017 345
1898375544837 8 2017 343
1425929142277 8 2017 344
1013612281863 8 2017 355
1795296329731 8 2017 348
695784701956 8 2017 363
Problem is that data is grouped only by start_date
, and its not clear for me how to distribute data by all affected months range from date_start
to date_end
.
CodePudding user response:
You can use sequence
function to generate the dates between date_start
and date_end
then explode the generated array column and group by and count as you already did:
df.createOrReplaceTempView("EventsTable")
spark.sql("""
WITH events AS (
SELECT place_id,
explode(event_dates) as event_date
FROM (
SELECT place_id,
sequence(date_start, date_end, interval 1 day) as event_dates
FROM EventsTable
)
)
SELECT place_id,
month(event_date) as month,
year(event_date) as year,
count(*) as events
FROM events
GROUP BY 1, 2, 3
ORDER BY month, year, events desc
""").show()
// ------------- ----- ---- ------
//| place_id|month|year|events|
// ------------- ----- ---- ------
//|1898375544837| 8|2017| 3|
//|695784701956 | 9|2017| 20|
//|1425929142277| 9|2017| 8|
//|2826088480774| 9|2017| 2|
//|1013612281863| 10|2016| 3|
//|1795296329731| 10|2016| 1|
//|695784701956 | 10|2017| 31|
//|1425929142277| 10|2017| 3|
//|695784701956 | 11|2017| 2|
// ------------- ----- ---- ------