SQL: Understanding the "n" in ROWS BETWEEN N PRECEEDING AND CURRENT ROW-CodePudding

I have this table ("my_data") in SQL:

     id var_1 var_2
1   1     1     1
2   1     1     1
3   1     0     0
4   1     0     1
5   1     0     1
6   1     1     1
7   2     1     0
8   2     0     0
9   2     0     1
10  2     1     0
11  2     0     1
12  3     0     1
13  4     0     0
14  4     1     0
15  5     1     1
16  5     0     0
17  5     1     0
18  5     1     0
19  5     1     1
20  6     0     0

I want to make the following new variables - For each unique ID:

var_1_lag_2 : take the average value of var_1 for the last two observations (if there are less than two observations, take the average of however many observations there are)
var_1_lag_3 : take the average value of var_1 for the last three observations (if there are less than three observations, take the average of however many observations there are)
var_1_lag_4 : take the average value of var_1 for the last four observations (if there are less than four observations, take the average of however many observations there are)
var_1_lag_all : take the average value of var_1 for as many observations as there are (this would be the same as taking the average of var_1 for each group of ID's and then left joining this back to "my_data")

I found out about the "ROWS BETWEEN PRECEEDING" function in SQL that works for my data (I think "n" can be just replaced with the value of the "lag"):

SELECT *,
  AVG(var_1) over(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) v_1
FROM my_data

Now, I am trying to correctly apply this code for my requirements (source: How to understand the results of rows between 2 preceding and current row?):

# does not run - Error: frame starting offset must be a non-negative integer
    SELECT *,
    AVG(var_1) over(PARTITION BY id  ORDER BY var_1 DESC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) var_1_lag_2,
   AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) var_1_lag_3,
   AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC  ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) var_1_lag_4,
   AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC   ROWS BETWEEN n PRECEDING AND CURRENT ROW) var_1_lag_all
    FROM my_data;

If I remove the last line from this code (i.e. AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN n PRECEDING AND CURRENT ROW) var_1_lag_all), then the code seems to run:

# runs
 CREATE TABLE A AS SELECT *,
    AVG(var_1) over(PARTITION BY id  ORDER BY var_1 DESC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) var_1_lag_2,
   AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) var_1_lag_3,
   AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC  ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) var_1_lag_4
    FROM my_data;

Then, I can just calculate group averages (var_1_lag_all) and then perform the join:

# group averages
CREATE TABLE B AS SELECT id, avg(var_1) as var_1_lag_all from my_data GROUP by id;

#final result
CREATE TABLE FINAL AS SELECT * FROM a LEFT JOIN b ON a.id = b.id;

Can someone please tell me if what I have done is correct?

Thanks!

CodePudding user response：

In the last one you want all the rows in the partition. Just remove the "window frame" since that is the default behavior, as in:

CREATE TABLE A AS 
SELECT *,
  AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) var_1_lag_2,
  AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) var_1_lag_3,
  AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) var_1_lag_4,
  AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC) var_1_lag_all
FROM my_data;

If you want to be technically explicit you can use UNBOUNDED in the place of n as in:

AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) var_1_lag_all