Home > Blockchain >  Group By based on consequtive flag in Redshift (Gaps and Islands problem)
Group By based on consequtive flag in Redshift (Gaps and Islands problem)

Time:11-16

I am trying to solve "gaps and islands" and group consecutive checks together. My data looks like this

site_id     date_id    location_id    reservation_id    revenue
   5        20210101      125            792727           100
   5        20210101      126            792728           90
   5        20210101      228            792757           200
   5        20210102      217            792977           50
   5        20210102      218            792978           120
   5        20210102      219            792979           100

I want to group by consecutive location_id and consecutive reservation_id (both should be consecutive respectively) within same date and site_id, and sum revenue. so for the example above the output should be:

site_id     date_id    location_id    reservation_id    revenue
   5        20210101      125            792727           190
   5        20210101      228            792757           200
   5        20210102      217            792977           270

Location_id and reservation_id are of no importance except for this particular task, so a simple MAX() or MIN() for these two columns will work.

CodePudding user response:

Try this one:

with mytable as (
  select 5 as site_id, '20210101' as date_id, 125 as location_id, 792727 as reservation_id, 100 as revenue union all
  select 5, '20210101', 126, 792728, 90 union all
  select 5, '20210101', 228, 792757, 200 union all
  select 5, '20210102', 217, 792977, 50 union all
  select 5, '20210102', 218, 792978, 120 union all
  select 5, '20210102', 219, 792979, 100
)
select site_id, date_id, min(location_id) as location_id, min(reservation_id) as reservation_id, sum(revenue) as revenue
from ( 
  select *, count(nullif(is_new_group, false)) over (order by site_id, date_id, location_id rows unbounded preceding) as new_group_id
  from (
    select *, coalesce(lag(location_id) over(partition by site_id, date_id order by location_id) != location_id-1, true) as is_new_group
    from mytable
  ) a
) b
group by site_id, date_id, new_group_id
order by new_group_id

CodePudding user response:

Try sessionization:

Two nested queries. First, a counter that is at 0 when a condition is false, at 1 when it's true; in our case, the previous reservation id not being exactly one less than the current one.

The second query queries the first query, and makes a running sum of the counter obtained before. This gives a session id.

Then, group by site id, date id and the obtained session id.

WITH
indata(site_id,date_id,location_id,reservation_id,revenue) AS (
          SELECT 5,DATE '2021-01-01',125,792727,100
UNION ALL SELECT 5,DATE '2021-01-01',126,792728,90
UNION ALL SELECT 5,DATE '2021-01-01',228,792757,200
UNION ALL SELECT 5,DATE '2021-01-02',217,792977,50
UNION ALL SELECT 5,DATE '2021-01-02',218,792978,120
UNION ALL SELECT 5,DATE '2021-01-02',219,792979,100
)
,
with_counter AS (
  SELECT
    site_id
  , date_id
  , location_id
  , reservation_id
  , revenue
  , CASE
      WHEN reservation_id - LAG(reservation_id) OVER(
         PARTITION BY site_id ORDER BY date_id,reservation_id
      ) > 1
      THEN 1
      ELSE 0
    END AS counter
  FROM indata
)
,
with_session AS (
  SELECT
    site_id
  , date_id
  , location_id
  , reservation_id
  , revenue
  , SUM(counter) OVER(
      PARTITION BY site_id ORDER BY date_id,reservation_id
    ) AS session_id
  FROM with_counter
  -- test output ...
  -- out  site_id |  date_id   | location_id | reservation_id | revenue | session_id 
  -- out --------- ------------ ------------- ---------------- --------- ------------
  -- out        5 | 2021-01-01 |         125 |         792727 |     100 |          0
  -- out        5 | 2021-01-01 |         126 |         792728 |      90 |          0
  -- out        5 | 2021-01-01 |         228 |         792757 |     200 |          1
  -- out        5 | 2021-01-02 |         217 |         792977 |      50 |          2
  -- out        5 | 2021-01-02 |         218 |         792978 |     120 |          2
  -- out        5 | 2021-01-02 |         219 |         792979 |     100 |          2
)
SELECT
  site_id
, date_id
, MIN(location_id   ) AS location_id
, MIN(reservation_id) AS reservation_id
, SUM(revenue       ) AS revenue
FROM with_session
GROUP BY
  site_id
, date_id
, session_id
ORDER BY
  site_id
, date_id
;
-- out  site_id |  date_id   | location_id | reservation_id | revenue 
-- out --------- ------------ ------------- ---------------- ---------
-- out        5 | 2021-01-01 |         125 |         792727 |     190
-- out        5 | 2021-01-01 |         228 |         792757 |     200
-- out        5 | 2021-01-02 |         217 |         792977 |     270                                                                                                                  
  • Related