I am trying to solve "gaps and islands" and group consecutive checks together. My data looks like this
site_id date_id location_id reservation_id revenue
5 20210101 125 792727 100
5 20210101 126 792728 90
5 20210101 228 792757 200
5 20210102 217 792977 50
5 20210102 218 792978 120
5 20210102 219 792979 100
I want to group by consecutive location_id and consecutive reservation_id (both should be consecutive respectively) within same date and site_id, and sum revenue. so for the example above the output should be:
site_id date_id location_id reservation_id revenue
5 20210101 125 792727 190
5 20210101 228 792757 200
5 20210102 217 792977 270
Location_id and reservation_id are of no importance except for this particular task, so a simple MAX() or MIN() for these two columns will work.
CodePudding user response:
Try this one:
with mytable as (
select 5 as site_id, '20210101' as date_id, 125 as location_id, 792727 as reservation_id, 100 as revenue union all
select 5, '20210101', 126, 792728, 90 union all
select 5, '20210101', 228, 792757, 200 union all
select 5, '20210102', 217, 792977, 50 union all
select 5, '20210102', 218, 792978, 120 union all
select 5, '20210102', 219, 792979, 100
)
select site_id, date_id, min(location_id) as location_id, min(reservation_id) as reservation_id, sum(revenue) as revenue
from (
select *, count(nullif(is_new_group, false)) over (order by site_id, date_id, location_id rows unbounded preceding) as new_group_id
from (
select *, coalesce(lag(location_id) over(partition by site_id, date_id order by location_id) != location_id-1, true) as is_new_group
from mytable
) a
) b
group by site_id, date_id, new_group_id
order by new_group_id
CodePudding user response:
Try sessionization:
Two nested queries. First, a counter that is at 0 when a condition is false, at 1 when it's true; in our case, the previous reservation id not being exactly one less than the current one.
The second query queries the first query, and makes a running sum of the counter obtained before. This gives a session id.
Then, group by site id, date id and the obtained session id.
WITH
indata(site_id,date_id,location_id,reservation_id,revenue) AS (
SELECT 5,DATE '2021-01-01',125,792727,100
UNION ALL SELECT 5,DATE '2021-01-01',126,792728,90
UNION ALL SELECT 5,DATE '2021-01-01',228,792757,200
UNION ALL SELECT 5,DATE '2021-01-02',217,792977,50
UNION ALL SELECT 5,DATE '2021-01-02',218,792978,120
UNION ALL SELECT 5,DATE '2021-01-02',219,792979,100
)
,
with_counter AS (
SELECT
site_id
, date_id
, location_id
, reservation_id
, revenue
, CASE
WHEN reservation_id - LAG(reservation_id) OVER(
PARTITION BY site_id ORDER BY date_id,reservation_id
) > 1
THEN 1
ELSE 0
END AS counter
FROM indata
)
,
with_session AS (
SELECT
site_id
, date_id
, location_id
, reservation_id
, revenue
, SUM(counter) OVER(
PARTITION BY site_id ORDER BY date_id,reservation_id
) AS session_id
FROM with_counter
-- test output ...
-- out site_id | date_id | location_id | reservation_id | revenue | session_id
-- out --------- ------------ ------------- ---------------- --------- ------------
-- out 5 | 2021-01-01 | 125 | 792727 | 100 | 0
-- out 5 | 2021-01-01 | 126 | 792728 | 90 | 0
-- out 5 | 2021-01-01 | 228 | 792757 | 200 | 1
-- out 5 | 2021-01-02 | 217 | 792977 | 50 | 2
-- out 5 | 2021-01-02 | 218 | 792978 | 120 | 2
-- out 5 | 2021-01-02 | 219 | 792979 | 100 | 2
)
SELECT
site_id
, date_id
, MIN(location_id ) AS location_id
, MIN(reservation_id) AS reservation_id
, SUM(revenue ) AS revenue
FROM with_session
GROUP BY
site_id
, date_id
, session_id
ORDER BY
site_id
, date_id
;
-- out site_id | date_id | location_id | reservation_id | revenue
-- out --------- ------------ ------------- ---------------- ---------
-- out 5 | 2021-01-01 | 125 | 792727 | 190
-- out 5 | 2021-01-01 | 228 | 792757 | 200
-- out 5 | 2021-01-02 | 217 | 792977 | 270