If I have the following dataframe/tibble in R:
tibble(
period =
c(
"2010Q1",
"2010Q2",
"2010Q3",
"2010Q4",
"2010END",
"2011Q1",
"2011Q2",
"2011Q3",
"2011Q4",
"2011END",
"2012Q1",
"2012Q2",
"2012Q3",
"2012Q4",
"20120END",
"2010Q1",
"2010Q2",
"2010Q3",
"2010Q4",
"2010END",
"2011Q1",
"2011Q2",
"2011Q3",
"2011Q4",
"2011END",
"2012Q1",
"2012Q2",
"2012Q3",
"2012Q4",
"20120END"),
website = c(
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook"
),
values = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1)
)
How can i perform the lag calculation for each period value from the year prior e.g I would like to create the calculation for 2011Q1 - 2010Q1 and so on including 2011END - 2010END
So that i get a table which looks like the following:
tibble(
period =
c(
"2010Q1",
"2010Q2",
"2010Q3",
"2010Q4",
"2010END",
"2011Q1",
"2011Q2",
"2011Q3",
"2011Q4",
"2011END",
"2012Q1",
"2012Q2",
"2012Q3",
"2012Q4",
"20120END",
"2010Q1",
"2010Q2",
"2010Q3",
"2010Q4",
"2010END",
"2011Q1",
"2011Q2",
"2011Q3",
"2011Q4",
"2011END",
"2012Q1",
"2012Q2",
"2012Q3",
"2012Q4",
"20120END"),
website = c(
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook"
),
values = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1),
calculation = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,14,12,10,8,6,4,2,0,-2,-4,-6,-8,-10,-12,-14))
Here it makes sense we cannot compare to the period prior so it is NA - for the year 2011 all values are calculated like this:
- 2010q1 - NA (nothing before)
- 2010q2 - NA (nothing before)
- 2010q3 - NA (nothing before)
- 2010q4 - NA (nothing before)
- 2010END - NA (nothing before)
- 2011q1 - 2010q1
- 2011q2 - 2010q2
- 2011q3 - 2010q3
- 2011q4 - 2010q4
- 2011END - 2010END
Having some issues with using the lag() function when trying to group the data by the period column.
CodePudding user response:
If every year is divided into 5 periods, lag(...,5)
should take the value 5 rows earlier that the one you are calculating.
example %>%
mutate(calculation = values - lag(values,5))
Output:
# A tibble: 30 x 3
period values calculation
<chr> <dbl> <dbl>
1 2010Q1 1 NA
2 2010Q2 2 NA
3 2010Q3 3 NA
4 2010Q4 4 NA
5 2010END 5 NA
6 2011Q1 6 5
7 2011Q2 7 5
8 2011Q3 8 5
9 2011Q4 9 5
10 2011END 10 5
# ... with 20 more rows
EDIT: As @AndrewGB accurately said, group_by(website)
must be added to separate operations for each website. Also, I'm assuming that rows are already arranged by period.
example %>%
group_by(website) %>%
mutate(calculation = values - lag(values,5))
CodePudding user response:
Here is another option using data.table
, where we can use shift
to calculate from the previous year matching value (e.g., 2011Q1 - 2010Q1
). I'm also assuming that you want to do this for each website, so I have added a grouping.
library(data.table)
setDT(dt)[, calculation := values - shift(values, n = 5, type = "lag"), by = website]
Output
period website values calculation
1: 2010Q1 google 1 NA
2: 2010Q2 google 2 NA
3: 2010Q3 google 3 NA
4: 2010Q4 google 4 NA
5: 2010END google 5 NA
6: 2011Q1 google 6 5
7: 2011Q2 google 7 5
8: 2011Q3 google 8 5
9: 2011Q4 google 9 5
10: 2011END google 10 5
11: 2012Q1 google 11 5
12: 2012Q2 google 12 5
13: 2012Q3 google 13 5
14: 2012Q4 google 14 5
15: 20120END google 15 5
16: 2010Q1 facebook 15 NA
17: 2010Q2 facebook 14 NA
18: 2010Q3 facebook 13 NA
19: 2010Q4 facebook 12 NA
20: 2010END facebook 11 NA
21: 2011Q1 facebook 10 -5
22: 2011Q2 facebook 9 -5
23: 2011Q3 facebook 8 -5
24: 2011Q4 facebook 7 -5
25: 2011END facebook 6 -5
26: 2012Q1 facebook 5 -5
27: 2012Q2 facebook 4 -5
28: 2012Q3 facebook 3 -5
29: 2012Q4 facebook 2 -5
30: 20120END facebook 1 -5
CodePudding user response:
If you are lagging not in the context of a panel regression, dplyr::lag
won't be your best bet. This is because it linearly takes the previous elements in the column as the lag.
I have used pivot_wider
to perform these lags and then pivot_longer
to reconstitute the original data frame plus the lagged variable. You may also need to separate your date column into year and quarter columns. Once you've done that, you should be able to calculate the correct lagging. I'll post with corresponding code in a little bit.