I have this table ("my_data") in SQL:
id var_1 var_2
1 1 1 1
2 1 1 1
3 1 0 0
4 1 0 1
5 1 0 1
6 1 1 1
7 2 1 0
8 2 0 0
9 2 0 1
10 2 1 0
11 2 0 1
12 3 0 1
13 4 0 0
14 4 1 0
15 5 1 1
16 5 0 0
17 5 1 0
18 5 1 0
19 5 1 1
20 6 0 0
I want to make the following new variables - For each unique ID:
var_1_lag_2 : take the average value of var_1 for the last two observations (if there are less than two observations, take the average of however many observations there are)
var_1_lag_3 : take the average value of var_1 for the last three observations (if there are less than three observations, take the average of however many observations there are)
var_1_lag_4 : take the average value of var_1 for the last four observations (if there are less than four observations, take the average of however many observations there are)
var_1_lag_all : take the average value of var_1 for as many observations as there are (this would be the same as taking the average of var_1 for each group of ID's and then left joining this back to "my_data")
I found out about the "ROWS BETWEEN PRECEEDING" function in SQL that works for my data (I think "n" can be just replaced with the value of the "lag"):
SELECT *,
AVG(var_1) over(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) v_1
FROM my_data
Now, I am trying to correctly apply this code for my requirements (source: How to understand the results of rows between 2 preceding and current row?):
# does not run - Error: frame starting offset must be a non-negative integer
SELECT *,
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) var_1_lag_2,
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) var_1_lag_3,
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) var_1_lag_4,
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN n PRECEDING AND CURRENT ROW) var_1_lag_all
FROM my_data;
If I remove the last line from this code (i.e. AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN n PRECEDING AND CURRENT ROW) var_1_lag_all
), then the code seems to run:
# runs
CREATE TABLE A AS SELECT *,
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) var_1_lag_2,
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) var_1_lag_3,
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) var_1_lag_4
FROM my_data;
Then, I can just calculate group averages (var_1_lag_all) and then perform the join:
# group averages
CREATE TABLE B AS SELECT id, avg(var_1) as var_1_lag_all from my_data GROUP by id;
#final result
CREATE TABLE FINAL AS SELECT * FROM a LEFT JOIN b ON a.id = b.id;
Can someone please tell me if what I have done is correct?
Thanks!
CodePudding user response:
In the last one you want all the rows in the partition. Just remove the "window frame" since that is the default behavior, as in:
CREATE TABLE A AS
SELECT *,
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) var_1_lag_2,
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) var_1_lag_3,
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) var_1_lag_4,
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC) var_1_lag_all
FROM my_data;
If you want to be technically explicit you can use UNBOUNDED
in the place of n
as in:
AVG(var_1) over(PARTITION BY id ORDER BY var_1 DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) var_1_lag_all