Home > Software design >  Efficiently get array of all previous dates per id per date limited to past 6 months in BigQuery
Efficiently get array of all previous dates per id per date limited to past 6 months in BigQuery

Time:07-14

I have a very big table 'DATES_EVENTS' (20 T) that looks like this:

ID      DATE   
1   '2022-04-01'
1   '2022-03-02'
1   '2022-03-01'
2   '2022-05-01'
3   '2021-12-01'
3   '2021-11-11'
3   '2020-11-11'
3   '2020-10-01'

I want per each row to get all past dates (per user) limited to up to 6 months.

My desired table:

ID      DATE                   DATE_list
1   '2022-04-01'   ['2022-04-01','2022-03-02','2022-03-01']
1   '2022-03-02'   ['2022-03-02','2022-03-01']
1   '2022-03-01'   ['2022-03-01']
2   '2022-05-01'   ['2022-05-01']
3   '2021-12-01'   ['2021-12-01','2021-11-11']
3   '2021-11-11'   ['2021-11-11']
3   '2020-11-11'   ['2020-11-11','2020-10-01']
3   '2020-10-01'   ['2020-10-01']

I have a solution for all dates not limited:

SELECT 
  ID, DATE, ARRAY_AGG(DATE) OVER (PARTITION BY ID ORDER BY DATE) as DATE_list
FROM
  DATES_EVENTS

But for a limited up to 6 months I don't have an efficient solution:

SELECT
  distinct A.ID, A.DATE, ARRAY_AGG(B.DATE) OVER (PARTITION BY B.ID ORDER BY B.DATE) as DATE_list
FROM 
  DATES_EVENTS A
INNER JOIN
  DATES_EVENTS B
ON
 A.ID=B.ID
 AND B.DATE BETWEEN DATE_SUB(A.DATE, INTERVAL 180 DAY) AND A.DATE
                           
** ruffly a solution

Anyone know of a good and efficient way to do what I need?

CodePudding user response:

Consider below approach

select id, date, array(
  select day 
  from t.date_list day
  where day <= date
  order by day desc
) as date_list
from (
  select *, array_agg(date) over win as date_list
  from dates_events
  window win as (
    partition by id 
    order by extract(year from date) * 12   extract(month from date) 
    range between 5 preceding and current row
  )
) t

if applied to sample data in your question - output is

enter image description here

In case if (as I noticed in your question) 180 days is appropriate substitution for 6 months for you - you can use below simpler version

select *, array_agg(date) over win as date_list
from dates_events
window win as (
  partition by id 
  order by unix_date(date) 
  range between current row and 179 following
)
  • Related