Home > Back-end >  Find second max value in the moving window frame using SQL / BigQuery
Find second max value in the moving window frame using SQL / BigQuery

Time:11-21

I have a requirement to calculate the second max value in the moving window of the past 120 minutes frame just like I have calculated max value in past 120 minutes moving frame partitioned on device_id.

MAX(Temperature) OVER(PARTITION BY device_type ORDER BY event_ts_seconds RANGE BETWEEN 7200 PRECEDING AND CURRENT ROW) AS Max_Temp_in_120frame

Similarly, I want to calculate the second max temperature.

enter image description here

I have tried the below 2 versions of NTH_VALUE but it is not working as expected.

NTH_VALUE(Temperature,2) OVER(PARTITION BY device_id ORDER BY event_ts_seconds RANGE BETWEEN 7200 PRECEDING AND CURRENT ROW) AS SecondMax_Temp_in_120frame

NTH_VALUE(Temperature,2) OVER(PARTITION BY device_id ORDER BY Temperature DESC RANGE BETWEEN 7200 PRECEDING AND CURRENT ROW) AS SecondMax_Temp_in_120frame

Help on this will be much appreciated

CodePudding user response:

First of all, NTH_VALUE won't work because it is a generalization of the window function FIRST_VALUE, not MAX.

In BigQuery, you can compute the second max value in three steps. First, use ARRAY_AGG to compute the moving array of values:

ARRAY_AGG(value) OVER(ORDER BY ts RANGE BETWEEN 3 PRECEDING AND CURRENT ROW) AS moving_array

This will give you something like this:

ts value moving_array_as_string
1 0.6 [0.6]
2 5.4 [0.6, 5.4]
3 5.4 [0.6, 5.4, 5.4]
4 5.9 [0.6, 5.4, 5.4, 5.9]
5 4.6 [5.4, 5.4, 5.9, 4.6]
6 6 [5.4, 5.9, 4.6, 6]
7 8.7 [5.9, 4.6, 6, 8.7]
8 3.5 [4.6, 6, 8.7, 3.5]
9 4 [6, 8.7, 3.5, 4]
10 0.7 [8.7, 3.5, 4, 0.7]

Once you have that, you can use ROW_NUMBER inside the array with an UNNEST to rank the values, like this:

ARRAY(SELECT AS STRUCT value, row_number() OVER(ORDER BY value DESC) as rank FROM UNNEST(moving_array) as value ORDER BY value DESC) as ranked_moving_array

This will give you something like this (I did not display the rank, but you can see the values have been sorted inside each array):

ts value ranked_moving_array_as_string
1 0.6 [0.6]
2 5.4 [5.4, 0.6]
3 5.4 [5.4, 5.4, 0.6]
4 5.9 [5.9, 5.4, 5.4, 0.6]
5 4.6 [5.9, 5.4, 5.4, 4.6]
6 6 [6, 5.9, 5.4, 4.6]
7 8.7 [8.7, 6, 5.9, 4.6]
8 3.5 [8.7, 6, 4.6, 3.5]
9 4 [8.7, 6, 4, 3.5]
10 0.7 [8.7, 4, 3.5, 0.7]

Finally, you can reuse this ranked_array with another UNNEST to pick the second max value (or any n-th value), like this:

(SELECT value FROM UNNEST(ranked_moving_array) WHERE rank=2) second_max_value_in_3_frame

The final query looks like this:

WITH sample AS (
  SELECT 
    ts,
    ROUND(10*RAND(), 1) as value
  FROM UNNEST (GENERATE_ARRAY(1, 10)) ts
)
, sample_with_moving_array AS (
    SELECT
        ts,
        value,
        ARRAY_AGG(value) OVER(ORDER BY ts RANGE BETWEEN 3 PRECEDING AND CURRENT ROW) AS moving_array,
    FROM sample
)
, sample_with_ranked_moving_array AS (
    SELECT 
        ts,
        value,
        moving_array,
        ARRAY(SELECT AS STRUCT value, row_number() OVER(ORDER BY value DESC) as rank FROM UNNEST(moving_array) as value ORDER BY value DESC) as ranked_moving_array
    FROM sample_with_moving_array
)
SELECT 
    ts,
    value,
    MAX(value) OVER(ORDER BY ts RANGE BETWEEN 3 PRECEDING AND CURRENT ROW) AS max_value_in_3_frame,
    (SELECT value FROM UNNEST(ranked_moving_array) WHERE rank=1) max_value_in_3_frame_alternate_version,
    (SELECT value FROM UNNEST(ranked_moving_array) WHERE rank=2) second_max_value_in_3_frame,
    CONCAT("[", ARRAY_TO_STRING(ARRAY(SELECT CAST(value as STRING) FROM UNNEST(moving_array) value), ", "), "]") as moving_array_as_string,
    CONCAT("[", ARRAY_TO_STRING(ARRAY(SELECT CAST(elem.value as STRING) FROM UNNEST(ranked_moving_array) elem), ", "), "]") as ranked_moving_array_as_string
FROM sample_with_ranked_moving_array

Which returns:

ts value max_value_in_3_frame max_value_in_3_frame_alternate_version second_max_value_in_3_frame moving_array_as_string ranked_moving_array_as_string
1 0.6 0.6 0.6 null [0.6] [0.6]
2 5.4 5.4 5.4 0.6 [0.6, 5.4] [5.4, 0.6]
3 5.4 5.4 5.4 5.4 [0.6, 5.4, 5.4] [5.4, 5.4, 0.6]
4 5.9 5.9 5.9 5.4 [0.6, 5.4, 5.4, 5.9] [5.9, 5.4, 5.4, 0.6]
5 4.6 5.9 5.9 5.4 [5.4, 5.4, 5.9, 4.6] [5.9, 5.4, 5.4, 4.6]
6 6 6 6 5.9 [5.4, 5.9, 4.6, 6] [6, 5.9, 5.4, 4.6]
7 8.7 8.7 8.7 6 [5.9, 4.6, 6, 8.7] [8.7, 6, 5.9, 4.6]
8 3.5 8.7 8.7 6 [4.6, 6, 8.7, 3.5] [8.7, 6, 4.6, 3.5]
9 4 8.7 8.7 6 [6, 8.7, 3.5, 4] [8.7, 6, 4, 3.5]
10 0.7 8.7 8.7 4 [8.7, 3.5, 4, 0.7] [8.7, 4, 3.5, 0.7]

As a control measure, I also computed the max value with the old and new methods to check I had the same results with both.

To conclude, you are lucky to be on BigQuery, as I don't think other SQL engines support all the capabilities used in this query, at least not with such an elegant syntax.

CodePudding user response:

Consider below approach

SELECT * EXCEPT(Array_Temp_in_120frame, Two_Max_Temp_in_120frame),
  Two_Max_Temp_in_120frame[SAFE_OFFSET(0)] Max_Temp_in_120frame, 
  Two_Max_Temp_in_120frame[SAFE_OFFSET(1)] SecondMax_Temp_in_120frame 
FROM (
  SELECT *, ARRAY(
      SELECT Temperature FROM t.Array_Temp_in_120frame Temperature
      ORDER BY Temperature DESC LIMIT 2
    ) Two_Max_Temp_in_120frame
  FROM (
    SELECT *, ARRAY_AGG(Temperature) OVER(
        PARTITION BY device_type ORDER BY event_ts_seconds 
        RANGE BETWEEN 7200 PRECEDING AND CURRENT ROW
      ) AS Array_Temp_in_120frame
    FROM your_table
  ) t
)
  • Related