Home > other >  SQL: Understanding the "n" in ROWS BETWEEN N PRECEEDING AND CURRENT ROW
SQL: Understanding the "n" in ROWS BETWEEN N PRECEEDING AND CURRENT ROW

Time:06-24

I have this table ("my_data") in SQL:

     id var_1 var_2
1   1     1     1
2   1     1     1
3   1     0     0
4   1     0     1
5   1     0     1
6   1     1     1
7   2     1     0
8   2     0     0
9   2     0     1
10  2     1     0
11  2     0     1
12  3     0     1
13  4     0     0
14  4     1     0
15  5     1     1
16  5     0     0
17  5     1     0
18  5     1     0
19  5     1     1
20  6     0     0

I want to make the following new variables - For each unique ID:

  • var_1_lag_2 : take the average value of var_1 for the last two observations (if there are less than two observations, take the average of however many observations there are)

  • var_1_lag_3 : take the average value of var_1 for the last three observations (if there are less than three observations, take the average of however many observations there are)

  • var_1_lag_4 : take the average value of var_1 for the last four observations (if there are less than four observations, take the average of however many observations there are)

  • var_1_lag_all : take the average value of var_1 for as many observations as there are (this would be the same as taking the average of var_1 for each group of ID's and then left joining this back to "my_data")

I found out about the "ROWS BETWEEN PRECEEDING" function in SQL that works for my data (I think "n" can be just replaced with the value of the "lag"):

SELECT *,
  AVG(var_1) over(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) v_1
FROM my_data

Now, I am trying to correctly apply this code for my requirements (source: How to understand the results of rows between 2 preceding and current row?):

# does not run - Error: frame starting offset must be a non-negative integer
    SELECT *,
    AVG(var_1) over(PARTITION BY id  ORDER BY var_1 DESC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) var_1_lag_2,
   AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) var_1_lag_3,
   AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC  ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) var_1_lag_4,
   AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC   ROWS BETWEEN n PRECEDING AND CURRENT ROW) var_1_lag_all
    FROM my_data;

If I remove the last line from this code (i.e. AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN n PRECEDING AND CURRENT ROW) var_1_lag_all), then the code seems to run:

# runs
 CREATE TABLE A AS SELECT *,
    AVG(var_1) over(PARTITION BY id  ORDER BY var_1 DESC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) var_1_lag_2,
   AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) var_1_lag_3,
   AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC  ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) var_1_lag_4
    FROM my_data;

Then, I can just calculate group averages (var_1_lag_all) and then perform the join:

# group averages
CREATE TABLE B AS SELECT id, avg(var_1) as var_1_lag_all from my_data GROUP by id;

#final result
CREATE TABLE FINAL AS SELECT * FROM a LEFT JOIN b ON a.id = b.id;

Can someone please tell me if what I have done is correct?

Thanks!

CodePudding user response:

In the last one you want all the rows in the partition. Just remove the "window frame" since that is the default behavior, as in:

CREATE TABLE A AS 
SELECT *,
  AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) var_1_lag_2,
  AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) var_1_lag_3,
  AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) var_1_lag_4,
  AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC) var_1_lag_all
FROM my_data;

If you want to be technically explicit you can use UNBOUNDED in the place of n as in:

AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) var_1_lag_all
  • Related