I am having problems with frollapply
from the data.table
package. I have a dataset with a target_var
column that consists of lists of equal length containing either numeric entries or NAs. I want to calculate the number of unique entries in these lists in a sliding window of length 2.
Surprisingly, the function slides through the sequence NA, 1, NA, 2, NA, 3, NA, 4, NA, 5, NA, 6
in steps of two starting with NA, 1
. To check this uncomment the lines in the FUN
parameter.
# Packages
lapply(c("data.table","dplyr","tibble","dtplyr"), library, character.only = TRUE)
# Test data
dummy_data <- tribble(
~date, ~target_var,
"2022-10-20", as.double(list(NA, NA , NA)),
"2022-10-21", as.double(list(NA, 1 , NA)),
"2022-10-22", as.double(list(2, NA, 3)),
"2022-10-23", as.double(list(NA, 4, NA)),
"2022-10-24", as.double(list(5, NA, 6))
)
# Sliding window
dummy_data %>%
lazy_dt() %>%
mutate(new_var = data.table::frollapply(
x = target_var,
n = 2,
align = "right",
FUN = function(x){
# browser()
# print(x)
x %>%
unlist(recursive = FALSE,
use.names = FALSE) %>%
n_distinct(na.rm = TRUE)
}
)) %>%
as_tibble()
# Expected results
expected_res <- tribble(
~date, ~target_var, ~new_var,
"2022-10-20", as.double(list(NA, NA , NA)), NA,
"2022-10-21", as.double(list(NA, 1 , NA)), 1,
"2022-10-22", as.double(list(2, NA, 3)), 3,
"2022-10-23", as.double(list(NA, 4, NA)), 3,
"2022-10-24", as.double(list(5, NA, 6)), 3
)
However, I expected the sliding window to slide through the rows of the dataset, starting with NA, 1, NA, 2, NA, 3
, i.e. the first two lists unpacked. However, I am not sure if frollapply
can combine the two lists within the sliding window or what happens exactly.
Another issue is that the new variable is a list and not a single number, which is also unexpected.
The inner FUN
works as expected when by-passing frollapply
.
dummy_data$target_var %>%
unlist(recursive = FALSE,
use.names = FALSE) %>%
n_distinct(na.rm = TRUE)
I have thought about concatenating the entries rather than creating a list, but the string processing steps turned out to be very inefficient. Does anyone have any idea why frollapply
doesn`t work as expected in this context or what I am missing here?
CodePudding user response:
dummy_data %>%
mutate(new_var = map_dbl(row_number()-1,
~target_var[.x:(.x 1)]%>%
unlist()%>%
na.omit()%>%
n_distinct()%>%
na_if(0)))
# A tibble: 5 × 3
date target_var new_var
<chr> <list> <dbl>
1 2022-10-20 <dbl [3]> NA
2 2022-10-21 <dbl [3]> 1
3 2022-10-22 <dbl [3]> 3
4 2022-10-23 <dbl [3]> 3
5 2022-10-24 <dbl [3]> 3
CodePudding user response:
With data.table
: first use sapply
to sum
on each row, then use frollsum
setDT(dummy_data)[,new_var:=frollsum(sapply(target_var,function(x) sum(!is.na(x))),2)][]
date target_var new_var
<char> <list> <num>
1: 2022-10-20 NA,NA,NA NA
2: 2022-10-21 NA, 1,NA 1
3: 2022-10-22 2,NA, 3 3
4: 2022-10-23 NA, 4,NA 3
5: 2022-10-24 5,NA, 6 3
Note that first line will always be NA
because partial sum isn't implemented.
If you need a partial rollsum, a workaround is to use and adaptive=TRUE
window.