Home > Software design >  Efficiently filter out where all columns are zeros consecutively
Efficiently filter out where all columns are zeros consecutively

Time:11-01

I have a dataset like the one below (actual dataset has 5M rows with no gaps), where I am trying to filter out rows where the sum of all numeric columns for the row itself and its previous and next rows is equal to zero.

N.B.

  • Time is a dttm column in the actual data.
  • Number of consecutive zeros can be more than 3 rows and in that case multiple rows will be filtered out.
# A tibble: 13 x 4
   group  Time  Val1  Val2
   <chr> <int> <dbl> <dbl>
 1 A         1   0     0  
 2 B         1   0.1   0  
 3 A         3   0     0  
 4 B         3   0     0  
 5 A         2   0     0  
 6 B         2   0.2   0.2
 7 B         4   0     0  
 8 A         4   0     0.1
 9 A         5   0     0  
10 A         6   0     0  
11 B         6   0.1   0.5
12 B         5   0.1   0.2
13 A         7   0     0  

See the example below for what is desired:

# A tibble: 13 x 8
   group  Time  Val1  Val2 rowsum leadsum lagsum   sum
   <chr> <int> <dbl> <dbl>  <dbl>   <dbl>  <dbl> <dbl>
 1 A         1   0     0      0       0     NA    NA  
 2 A         2   0     0      0       0      0     0     This will get filtered out! 
 3 A         3   0     0      0       0.1    0     0.1
 4 A         4   0     0.1    0.1     0      0     0.1
 5 A         5   0     0      0       0      0.1   0.1
 6 A         6   0     0      0       0      0     0     This will get filtered out!
 7 A         7   0     0      0      NA      0    NA  
 8 B         1   0.1   0      0.1     0.4   NA    NA  
 9 B         2   0.2   0.2    0.4     0      0.1   0.5
10 B         3   0     0      0       0      0.4   0.4
11 B         4   0     0      0       0.3    0     0.3
12 B         5   0.1   0.2    0.3     0.6    0     0.9
13 B         6   0.1   0.5    0.6    NA      0.3  NA  

So far I have tried to do this simply by using dplyr::lag() and dplyr::lead(); but this is extremely inefficient and throws a memory allocation error for the actual dataset:

>     Error in Sys.getenv("TESTTHAT") : 
>       could not allocate memory (0 Mb) in C function 'R_AllocStringBuffer'

This is what I have so far; I can first get the sum of Val1 and Val2 and then perform lead and lag but that won't resolve the issue.

df0 %>% 
  ##arrange by group is not necessary since we're grouping by that var
  arrange(group, Time) %>% 
  group_by(group) %>% 
  mutate(sum = Val1   Val2   lag(Val1)   lag(Val2)   lead(Val1)   lead(Val2)) # %>% 
  # filter(is.na(sum) | sum != 0)
  ## commenting out filter to show the full results
# >  # A tibble: 13 x 5
# >  # Groups:   group [2]
# >  group  Time  Val1  Val2   sum
# >  <chr> <int> <dbl> <dbl> <dbl>
# >  1  A   1     0     0      NA  
# !  -  A   2     0     0      0  
# >  2  A   3     0     0      0.1
# >  3  A   4     0     0.1    0.1
# >  4  A   5     0     0      0.1
# !  -  A   6     0     0      0  
# >  5  A   7     0     0      NA  
# >  6  B   1     0.1   0      NA  
# >  7  B   2     0.2   0.2    0.5
# >  8  B   3     0     0      0.4
# >  9  B   4     0     0      0.3
# >  10 B   5     0.1   0.2    0.9
# >  11 B   6     0.1   0.5    NA  
Toy dataset:
df0 <- structure(list(group = c("A", "B", "A", "B", "A", "B", 
                                "B", "A", "A", "A", "B", "B", "A"),
                      Time = c(1L, 1L, 3L, 3L, 2L, 2L, 4L, 4L, 5L, 6L, 6L, 5L, 7L), 
                      Val1 = c(0, 0.1, 0, 0, 0, 0.2, 0, 0, 0, 0, 0.1, 0.1, 0), 
                      Val2 = c(0, 0, 0, 0, 0, 0.2, 0, 0.1, 0, 0, 0.5, 0.2, 0)), 
                 row.names = c(NA, -13L), 
                 class = c("tbl_df", "tbl", "data.frame"))

CodePudding user response:

We can use base rle, or its faster implementation, rlenc implemented in the purler package.

library(tidyverse)
library(purler)
subsetter <- function(df){
  df %>%
    select(where(is.double)) %>%
    rowSums() %>%
    purler::rlenc() %>%
    filter(lengths >= 3L & values == 0L) %>%
    transmute(ids = map2(start, start   lengths, ~ (.x   1) : (.y - 2))) %>%
    unlist(use.names = F)
}
# to get data as shown in example
df0 <- df0 %>%
  mutate(Time = as.character(Time)) %>%
  arrange(group, Time)

edge_cases <- tribble(
  ~group, ~Time, ~Val1, ~Val2,
  "C", "1", 0, 0,
  "C", "2", 0, 0,
  "C", "3", 0, 0,
  "C", "4", 0, 0,
)

df1 <- rbind(df0, edge_cases)
df1 %>%
  `[`(-subsetter(.),)

# A tibble: 13 x 4
   group Time   Val1  Val2
   <chr> <chr> <dbl> <dbl>
 1 A     1       0     0  
 2 A     3       0     0  
 3 A     4       0     0.1
 4 A     5       0     0  
 5 A     7       0     0  
 6 B     1       0.1   0  
 7 B     2       0.2   0.2
 8 B     3       0     0  
 9 B     4       0     0  
10 B     5       0.1   0.2
11 B     6       0.1   0.5
12 C     1       0     0  
13 C     4       0     0  
bench::mark(df1 %>% `[`(-subsetter(.),))[,c(3,5,7)]
# A tibble: 1 x 3
    median mem_alloc n_itr
  <bch:tm> <bch:byt> <int>
1   3.91ms    9.38KB    93

CodePudding user response:

Since you tagged , here's a data.table-native solution:

library(data.table)
dt0 <- as.data.table(df0)
setorder(dt0, Time) # add 'group' if you want

isnum <- names(which(sapply(dt0, function(z) is.numeric(z) & !is.integer(z))))
isnum
# [1] "Val1" "Val2"

dt0[, sum0 := abs(rowSums(.SD)) < 1e-9, .SDcols = isnum
  ][, .SD[(c(0,sum0[-.N])   sum0   c(sum0[-1],0)) < 3,], by = .(group)
  ][, sum0 := NULL ]
#      group  Time  Val1  Val2
#     <char> <int> <num> <num>
#  1:      A     1   0.0   0.0
#  2:      A     3   0.0   0.0
#  3:      A     4   0.0   0.1
#  4:      A     5   0.0   0.0
#  5:      A     7   0.0   0.0
#  6:      B     1   0.1   0.0
#  7:      B     2   0.2   0.2
#  8:      B     3   0.0   0.0
#  9:      B     4   0.0   0.0
# 10:      B     5   0.1   0.2
# 11:      B     6   0.1   0.5

Per your comment, both A-2 and A-6 have been removed.

Efficiencies:

  • rowSums is fast and efficient;
  • We shift using direct indexing with a default of 0; in data.table, this is handled very efficiently, and does not incur the (admittedly small) overhead of lead/lag/shift calls;
  • After we sum a row, we only row-shift this one value instead of four row-shifts per row.

CodePudding user response:

library(tidyverse)
df0 %>%
  arrange(group, Time) %>%  # EDIT to arrange by time (and group for clarity)
  rowwise() %>%
  mutate(sum = sum(c_across(Val1:Val2))) %>%
  group_by(group) %>%
  filter( !(sum == 0 & lag(sum, default = 1) == 0 & lead(sum, default = 1) == 0)) %>%
  ungroup()

# A tibble: 11 x 5
   group  Time  Val1  Val2   sum
   <chr> <int> <dbl> <dbl> <dbl>
 1 A         1   0     0     0  
 2 A         3   0     0     0  
 3 A         4   0     0.1   0.1
 4 A         5   0     0     0  
 5 A         7   0     0     0  
 6 B         1   0.1   0     0.1
 7 B         2   0.2   0.2   0.4
 8 B         3   0     0     0  
 9 B         4   0     0     0  
10 B         5   0.1   0.2   0.3
11 B         6   0.1   0.5   0.6
  • Related