I found a bunch of similar questions but not addressing this one specifically (correct me if I'm wrong).
I am trying---on BigQuery---to index each row on a table with the running count of user per year using an analytical function.
So with:
with dataset as (
select 'A' as user, '2020' as year, RAND() as some_value
union all
select 'A' as user, '2020' as year, RAND() as some_value
union all
select 'B' as user, '2020' as year, RAND() as some_value
union all
select 'B' as user, '2020' as year, RAND() as some_value
union all
select 'B' as user, '2020' as year, RAND() as some_value
union all
select 'C' as user, '2020' as year, RAND() as some_value
union all
select 'C' as user, '2020' as year, RAND() as some_value
union all
select 'A' as user, '2021' as year, RAND() as some_value
union all
select 'A' as user, '2021' as year, RAND() as some_value
union all
select 'B' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
)
I would like to get:
rcount | user | year | some_value
1 | A | 2020 | 0.2365421124968884
1 | A | 2020 | 0.21087749308191206
2 | B | 2020 | 0.6096882013526258
2 | B | 2020 | 0.8544447727632739
2 | B | 2020 | 0.6113604025541309
3 | C | 2020 | 0.5803237472480643
3 | C | 2020 | 0.165305669127888
1 | A | 2021 | 0.1200575362708826
1 | A | 2021 | 0.015721175944171915
2 | B | 2021 | 0.21890252010457295
3 | C | 2021 | 0.5087613385277634
3 | C | 2021 | 0.9949262690813603
3 | C | 2021 | 0.50824183164116
3 | C | 2021 | 0.8262428736484341
3 | C | 2021 | 0.6866964737106948
I tried :
count(user) over (partition by year,user )
I also tried using ranges like order by year range between unbounded preceding and current row
and row_count()
I have no idea where to tap for a solution now.
CodePudding user response:
Try the following:
select user
, year
, some_value
, sum(count) over (partition by year order by year, user ROWS UNBOUNDED PRECEDING) as rcount
from (
select user
, year
, some_value
, IF(lag(user,1) OVER (order by year,user)=user,0,1) count
from dataset
)
The sub-select is defining the logic of whether to count the record or not based on what the previous row was, then we simply perform a sum with the outer select.
CodePudding user response:
A simpler solution would be to use DENSE_RANK
:
SELECT
DENSE_RANK() OVER (PARTITION BY year ORDER BY user) as rcount,
user,
year,
some_value
FROM dataset
Information about DENSE_RANK
can be found here.