Using dplyr, I'm looking to summarise a new column of data as a lagged version of an existing column of grouped data.
Reprex:
dateidx <- as.Date(c("2019-01-02", "2019-01-032", "2019-01-02", "2019-01-07", "2019-01-07", "2019-01-07", "2019-01-10", "2019-01-10"))
A <- c(100, 100, 200, 200, 200, 300, 400, 135)
B <- c(1500, 2000, 1350, 780, 45, 200, 150, 250)
test.df1 <- data.frame(dateidx, A, B)
> test.df1
dateidx A B
1 2019-01-02 100 1500
2 2019-01-03 100 2000
3 2019-01-02 200 1350
4 2019-01-07 200 780
5 2019-01-07 200 45
6 2019-01-07 300 200
7 2019-01-10 400 150
8 2019-01-10 135 250
Objective:
>dateidx <- c("2019-01-02","2019-01-03", "2019-01-07", "2019-01-10")
> sumA <- c(300, 100, 700, 535)
> sumAlag <- c(NA, 300, 100, 700)
> meanB <- c(1425, 2000, 342, 200)
> meanBlag <- c(NA, 1425, 2000, 342)
> test.obj <- data.frame (dateidx2, sumA, sumAlag, meanB, meanBlag)
> test.obj
dateidx sumA sumAlag meanB meanBlag
1 2019-01-02 300 NA 1425 NA
2 2019-01-03 100 300 2000 1425
3 2019-01-07 700 100 342 2000
4 2019-01-10 535 700 200 342
My code:
test.df2 <- test.df1 %>%
group_by(dateidx) %>%
summarise (
sumA = sum(A),
sumAlag = lag(sumA),
meanB = mean(B),
meanBlag =dplyr::lag(meanB)
)
Desired Results:
dateidx sumA sumAlag meanB meanBlag
1 2019-01-02 300 NA 1425 NA
2 2019-01-03 100 300 2000 1425
3 2019-01-07 700 100 342 2000
4 2019-01-10 535 700 200 342
Actual Results:
> test.df2
# A tibble: 4 × 5
dateidx sumA sumAlag meanB meanBlag
<date> <dbl> <dbl> <dbl> <dbl>
1 2019-01-02 300 NA 1425 NA
2 2019-01-03 100 NA 2000 NA
3 2019-01-07 700 NA 342. NA
4 2019-01-10 535 NA 200 NA
Attempts to Fix:
disambiguation (like the dplyr / plyr:: mutate issue)
making a dummy variable
re-specifying the grouping using "order-by"
a dplyr error which was fixed years ago
Sources:
https://dplyr.tidyverse.org/reference/lead-lag.html: but no grouping discussed
https://dplyr.tidyverse.org/reference/lead-lag.html: leads or lags at first order of granularity only
dplyr: lead() and lag() wrong when used with group_by(): about making sure the lag is disambiguated: other issues not relevant
Lag function on grouped data: for python
dplyr lag across groups: lagging explicitly not within groups
CodePudding user response:
You want to first summarise
to get the sum and the mean, then you can use a mutate
statement to get the lag of each column, then rearrange the columns.
library(tidyverse)
test.df2 <- test.df1 %>%
group_by(dateidx) %>%
summarise(sumA = sum(A),
meanB = mean(B)) %>%
mutate(sumAlag = lag(sumA),
meanBlag = lag(meanB)) %>%
select(dateidx, starts_with("sum"), starts_with("mean"))
Output
dateidx sumA sumAlag meanB meanBlag
<date> <dbl> <dbl> <dbl> <dbl>
1 2019-01-02 300 NA 1425 NA
2 2019-01-03 100 300 2000 1425
3 2019-01-07 700 100 342. 2000
4 2019-01-10 535 700 200 342.