Home > OS >  BigQuery Running Count of Unique ID per Year
BigQuery Running Count of Unique ID per Year

Time:03-09

I found a bunch of similar questions but not addressing this one specifically (correct me if I'm wrong).

I am trying---on BigQuery---to index each row on a table with the running count of user per year using an analytical function.

So with:

with dataset as (
    select 'A' as user, '2020' as year, RAND() as some_value
    union all
    select 'A' as user, '2020' as year, RAND() as some_value
    union all
    select 'B' as user, '2020' as year, RAND() as some_value
    union all
    select 'B' as user, '2020' as year, RAND() as some_value
    union all
    select 'B' as user, '2020' as year, RAND() as some_value
    union all
    select 'C' as user, '2020' as year, RAND() as some_value
    union all
    select 'C' as user, '2020' as year, RAND() as some_value
    union all
    select 'A' as user, '2021' as year, RAND() as some_value
    union all
    select 'A' as user, '2021' as year, RAND() as some_value
    union all
    select 'B' as user, '2021' as year, RAND() as some_value
    union all
    select 'C' as user, '2021' as year, RAND() as some_value
    union all
    select 'C' as user, '2021' as year, RAND() as some_value
    union all
    select 'C' as user, '2021' as year, RAND() as some_value
    union all
    select 'C' as user, '2021' as year, RAND() as some_value
 union all
    select 'C' as user, '2021' as year, RAND() as some_value    
)

I would like to get:

rcount  | user  | year | some_value
1       | A     | 2020 | 0.2365421124968884
1       | A     | 2020 | 0.21087749308191206
2       | B     | 2020 | 0.6096882013526258
2       | B     | 2020 | 0.8544447727632739
2       | B     | 2020 | 0.6113604025541309
3       | C     | 2020 | 0.5803237472480643
3       | C     | 2020 | 0.165305669127888
1       | A     | 2021 | 0.1200575362708826
1       | A     | 2021 | 0.015721175944171915
2       | B     | 2021 | 0.21890252010457295
3       | C     | 2021 | 0.5087613385277634
3       | C     | 2021 | 0.9949262690813603
3       | C     | 2021 | 0.50824183164116
3       | C     | 2021 | 0.8262428736484341
3       | C     | 2021 | 0.6866964737106948

I tried :

count(user) over (partition by year,user )

I also tried using ranges like order by year range between unbounded preceding and current row

and row_count() I have no idea where to tap for a solution now.

CodePudding user response:

Try the following:

select user
    , year
    , some_value
    , sum(count) over (partition by year order by year, user ROWS UNBOUNDED PRECEDING)  as rcount
from (
    select user
        , year
        , some_value
        , IF(lag(user,1) OVER (order by year,user)=user,0,1) count
    from dataset
) 

The sub-select is defining the logic of whether to count the record or not based on what the previous row was, then we simply perform a sum with the outer select.

CodePudding user response:

A simpler solution would be to use DENSE_RANK:

SELECT 
  DENSE_RANK() OVER (PARTITION BY year ORDER BY user) as rcount,
  user,
  year,
  some_value
FROM dataset

Information about DENSE_RANK can be found here.

  • Related