Home > Blockchain >  Nested window functions in Google BigQuery
Nested window functions in Google BigQuery

Time:09-30

I want to count unique IDs over all timestamps in the past per timestamp in case the last value of the ID is greater than 0 in a google BigQuery SQL. I don't want to GROUP BY cause I need the whole table as output. Also the table has > 1 billion rows so the query should be efficient.

Imagine I have a table like this:

| ID | value | timestamp  |
|:-- | ----- | ----------:|
| A  | 1     | 2021-01-01 |
| B  | 0     | 2021-01-01 |
| C  | 0     | 2021-01-01 |
| A  | 0     | 2021-01-02 |
| B  | 1     | 2021-01-02 |
| C  | 1     | 2021-01-03 |
| B  | 0     | 2021-01-04 |

the result should look like this:

| ID | value | timestamp  | count_val_gt_0 |
|:-- | ----- | ---------- | --------------:|
| A  | 1     | 2021-01-01 | 1              |
| B  | 0     | 2021-01-01 | 1              |
| C  | 0     | 2021-01-01 | 1              |
| A  | 0     | 2021-01-02 | 1              |
| B  | 1     | 2021-01-02 | 1              |
| C  | 1     | 2021-01-03 | 2              |
| B  | 0     | 2021-01-04 | 1              |

explanation:

timestamp  - set of unique IDs with last value > 0

2021-01-01: {A}
2021-02-01: {B}
2021-03-01: {B,C}
2021-04-01: {C}

For timestamp 2021-01-01 only A has a value greater 0. No timestamp before that. For all rows with timestamp 2021-01-02 I'm counting unique IDs in case last value of this ID is greater than 0 over the timestamps 2021-01-01 and 2021-01-02. The last value of A is no longer greater than 0 but now B is. For timestamp 2021-01-03 last value of B is still greater 0, now also last value of C, so I'm counting 2. For timestamp 2021-01-04 B is no longer greater 0, so its just C: 1.

What I tried was following enter image description here

Please note:

  1. above is not tested and was written just as an example for alternative solution to address ">1 billion issue"
  2. while not fully tested - i did very quick one and looks like it works as expected and at least for dummy example in your question output is correct
  3. for small data, already proposed solution by Jaytiger is more effective. but for really big/heavy cases like yours - I think this approach has good chances to be more effective
  • Related