The code posted at the bottom does a clean job of extending dates expressed both (a) as elapsed number of months per Period_1
in the code and (b) in YYYY-MM format defined as Period_2
. Item (b) uses the zoo package as.yearmon()
function for extending the YYYY-MM date.
However, when I run this code against the 2m row data file this is intended for, running Period_1
(item (a) above) is very fast (only 3.3 seconds) but running the code for Period_2
(item (b) above) is very slow (taking 1.74 minutes total). Is there a fast alternative to the date extensions for Period_2
, for example using data.table package or just dplyr?
Code:
start_time <- Sys.time()
library(dplyr)
library(tidyr)
library(zoo)
testDF <-
data.frame(
ID = as.numeric(c(rep(1,5),rep(50,3),rep(60,3))),
Period_1 = as.numeric(c(1:5,1:3,1:3)),
Period_2 = c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-10","2012-11","2012-12"),
Bal = as.numeric(c(rep(10,5),21:23,36:34)),
State = c("XX","AA","BB","CC","XX","AA","BB","CC","SS","XX","AA")
)
testPeriod_1 <-
testDF %>%
tidyr::complete(ID, nesting(Period_1)) %>%
tidyr::fill(Bal, State, .direction = "down")
testPeriod_2 <- testPeriod_1 %>%
group_by(ID) %>%
mutate(Period_2 = as.yearmon(first(Period_2)) seq(0, by=1/12, length=n())) %>%
mutate(Period_2 = format(Period_2, "%Y-%m")) %>%
ungroup
end_time <- Sys.time()
end_time - start_time
CodePudding user response:
Avoid the group-by:
library(hutilscpp)
testPeriod_1 %>%
mutate(Period_2 = as.yearmon(Period_2)) %>%
mutate(Period_2 = cumsum_reset(is.na(Period_2), is.na(Period_2) / 12) na.locf(Period_2)) %>%
mutate(Period_2 = format(Period_2, "%Y-%m"))
#Unit: milliseconds
# expr min lq mean median uq max neval
# testPeriod_2_faster 6.0256 6.4595 7.147424 6.54710 6.74730 19.3891 100
# testPeriod_2 16.3351 16.8705 18.686059 17.27665 18.46105 31.5112 100
This assumes the first year-month of a group is never NA
and you only need to extrapolate.