Home > Blockchain >  data.table::frollapply does not work as expected when the input is a column of lists
data.table::frollapply does not work as expected when the input is a column of lists

Time:10-28

I am having problems with frollapply from the data.table package. I have a dataset with a target_var column that consists of lists of equal length containing either numeric entries or NAs. I want to calculate the number of unique entries in these lists in a sliding window of length 2.

Surprisingly, the function slides through the sequence NA, 1, NA, 2, NA, 3, NA, 4, NA, 5, NA, 6 in steps of two starting with NA, 1. To check this uncomment the lines in the FUN parameter.

# Packages
lapply(c("data.table","dplyr","tibble","dtplyr"), library, character.only = TRUE)

# Test data
dummy_data <- tribble(
  ~date,         ~target_var,
  "2022-10-20",  as.double(list(NA, NA , NA)),
  "2022-10-21",  as.double(list(NA, 1 , NA)),
  "2022-10-22",  as.double(list(2,  NA, 3)),
  "2022-10-23",  as.double(list(NA, 4, NA)),
  "2022-10-24",  as.double(list(5,  NA, 6))
) 

# Sliding window
dummy_data %>%
  lazy_dt() %>% 
  mutate(new_var = data.table::frollapply(
    x = target_var,
    n = 2,
    align = "right",
    FUN = function(x){
      # browser()
      # print(x)
      x %>%
        unlist(recursive = FALSE,
               use.names = FALSE) %>% 
        n_distinct(na.rm = TRUE)
      }
    )) %>%
  as_tibble()

# Expected results
expected_res <- tribble(
  ~date,         ~target_var,                  ~new_var,
  "2022-10-20",  as.double(list(NA, NA , NA)), NA,
  "2022-10-21",  as.double(list(NA, 1 , NA)),  1,
  "2022-10-22",  as.double(list(2,  NA, 3)),   3,
  "2022-10-23",  as.double(list(NA, 4, NA)),   3,
  "2022-10-24",  as.double(list(5,  NA, 6)),   3
) 

However, I expected the sliding window to slide through the rows of the dataset, starting with NA, 1, NA, 2, NA, 3, i.e. the first two lists unpacked. However, I am not sure if frollapply can combine the two lists within the sliding window or what happens exactly.

Another issue is that the new variable is a list and not a single number, which is also unexpected.

The inner FUN works as expected when by-passing frollapply.

dummy_data$target_var %>% 
  unlist(recursive = FALSE,
         use.names = FALSE) %>% 
  n_distinct(na.rm = TRUE)

I have thought about concatenating the entries rather than creating a list, but the string processing steps turned out to be very inefficient. Does anyone have any idea why frollapply doesn`t work as expected in this context or what I am missing here?

CodePudding user response:

dummy_data %>%
  mutate(new_var = map_dbl(row_number()-1,
                 ~target_var[.x:(.x 1)]%>%
                   unlist()%>%
                   na.omit()%>%
                   n_distinct()%>%
                   na_if(0)))
# A tibble: 5 × 3
  date       target_var new_var
  <chr>      <list>       <dbl>
1 2022-10-20 <dbl [3]>       NA
2 2022-10-21 <dbl [3]>        1
3 2022-10-22 <dbl [3]>        3
4 2022-10-23 <dbl [3]>        3
5 2022-10-24 <dbl [3]>        3

CodePudding user response:

With data.table : first use sapply to sum on each row, then use frollsum

setDT(dummy_data)[,new_var:=frollsum(sapply(target_var,function(x) sum(!is.na(x))),2)][]

         date target_var new_var
       <char>     <list>   <num>
1: 2022-10-20   NA,NA,NA      NA
2: 2022-10-21   NA, 1,NA       1
3: 2022-10-22    2,NA, 3       3
4: 2022-10-23   NA, 4,NA       3
5: 2022-10-24    5,NA, 6       3

Note that first line will always be NA because partial sum isn't implemented.
If you need a partial rollsum, a workaround is to use and adaptive=TRUE window.

  • Related