I have a dataset like the one below (actual dataset has 5M rows with no gaps), where I am trying to filter out rows where the sum of all numeric columns for the row itself and its previous and next rows is equal to zero.
N.B.
Time
is adttm
column in the actual data.- Number of consecutive zeros can be more than 3 rows and in that case multiple rows will be filtered out.
# A tibble: 13 x 4
group Time Val1 Val2
<chr> <int> <dbl> <dbl>
1 A 1 0 0
2 B 1 0.1 0
3 A 3 0 0
4 B 3 0 0
5 A 2 0 0
6 B 2 0.2 0.2
7 B 4 0 0
8 A 4 0 0.1
9 A 5 0 0
10 A 6 0 0
11 B 6 0.1 0.5
12 B 5 0.1 0.2
13 A 7 0 0
See the example below for what is desired:
# A tibble: 13 x 8
group Time Val1 Val2 rowsum leadsum lagsum sum
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 0 0 0 0 NA NA
2 A 2 0 0 0 0 0 0 This will get filtered out!
3 A 3 0 0 0 0.1 0 0.1
4 A 4 0 0.1 0.1 0 0 0.1
5 A 5 0 0 0 0 0.1 0.1
6 A 6 0 0 0 0 0 0 This will get filtered out!
7 A 7 0 0 0 NA 0 NA
8 B 1 0.1 0 0.1 0.4 NA NA
9 B 2 0.2 0.2 0.4 0 0.1 0.5
10 B 3 0 0 0 0 0.4 0.4
11 B 4 0 0 0 0.3 0 0.3
12 B 5 0.1 0.2 0.3 0.6 0 0.9
13 B 6 0.1 0.5 0.6 NA 0.3 NA
So far I have tried to do this simply by using dplyr::lag()
and dplyr::lead()
; but this is extremely inefficient and throws a memory allocation error for the actual dataset:
> Error in Sys.getenv("TESTTHAT") : > could not allocate memory (0 Mb) in C function 'R_AllocStringBuffer'
This is what I have so far; I can first get the sum of Val1
and Val2
and then perform lead
and lag
but that won't resolve the issue.
df0 %>%
##arrange by group is not necessary since we're grouping by that var
arrange(group, Time) %>%
group_by(group) %>%
mutate(sum = Val1 Val2 lag(Val1) lag(Val2) lead(Val1) lead(Val2)) # %>%
# filter(is.na(sum) | sum != 0)
## commenting out filter to show the full results
# > # A tibble: 13 x 5
# > # Groups: group [2]
# > group Time Val1 Val2 sum
# > <chr> <int> <dbl> <dbl> <dbl>
# > 1 A 1 0 0 NA
# ! - A 2 0 0 0
# > 2 A 3 0 0 0.1
# > 3 A 4 0 0.1 0.1
# > 4 A 5 0 0 0.1
# ! - A 6 0 0 0
# > 5 A 7 0 0 NA
# > 6 B 1 0.1 0 NA
# > 7 B 2 0.2 0.2 0.5
# > 8 B 3 0 0 0.4
# > 9 B 4 0 0 0.3
# > 10 B 5 0.1 0.2 0.9
# > 11 B 6 0.1 0.5 NA
Toy dataset:
df0 <- structure(list(group = c("A", "B", "A", "B", "A", "B",
"B", "A", "A", "A", "B", "B", "A"),
Time = c(1L, 1L, 3L, 3L, 2L, 2L, 4L, 4L, 5L, 6L, 6L, 5L, 7L),
Val1 = c(0, 0.1, 0, 0, 0, 0.2, 0, 0, 0, 0, 0.1, 0.1, 0),
Val2 = c(0, 0, 0, 0, 0, 0.2, 0, 0.1, 0, 0, 0.5, 0.2, 0)),
row.names = c(NA, -13L),
class = c("tbl_df", "tbl", "data.frame"))
CodePudding user response:
We can use base rle
, or its faster implementation, rlenc
implemented in the purler
package.
library(tidyverse)
library(purler)
subsetter <- function(df){
df %>%
select(where(is.double)) %>%
rowSums() %>%
purler::rlenc() %>%
filter(lengths >= 3L & values == 0L) %>%
transmute(ids = map2(start, start lengths, ~ (.x 1) : (.y - 2))) %>%
unlist(use.names = F)
}
# to get data as shown in example
df0 <- df0 %>%
mutate(Time = as.character(Time)) %>%
arrange(group, Time)
edge_cases <- tribble(
~group, ~Time, ~Val1, ~Val2,
"C", "1", 0, 0,
"C", "2", 0, 0,
"C", "3", 0, 0,
"C", "4", 0, 0,
)
df1 <- rbind(df0, edge_cases)
df1 %>%
`[`(-subsetter(.),)
# A tibble: 13 x 4
group Time Val1 Val2
<chr> <chr> <dbl> <dbl>
1 A 1 0 0
2 A 3 0 0
3 A 4 0 0.1
4 A 5 0 0
5 A 7 0 0
6 B 1 0.1 0
7 B 2 0.2 0.2
8 B 3 0 0
9 B 4 0 0
10 B 5 0.1 0.2
11 B 6 0.1 0.5
12 C 1 0 0
13 C 4 0 0
bench::mark(df1 %>% `[`(-subsetter(.),))[,c(3,5,7)]
# A tibble: 1 x 3
median mem_alloc n_itr
<bch:tm> <bch:byt> <int>
1 3.91ms 9.38KB 93
CodePudding user response:
Since you tagged data.table, here's a data.table
-native solution:
library(data.table)
dt0 <- as.data.table(df0)
setorder(dt0, Time) # add 'group' if you want
isnum <- names(which(sapply(dt0, function(z) is.numeric(z) & !is.integer(z))))
isnum
# [1] "Val1" "Val2"
dt0[, sum0 := abs(rowSums(.SD)) < 1e-9, .SDcols = isnum
][, .SD[(c(0,sum0[-.N]) sum0 c(sum0[-1],0)) < 3,], by = .(group)
][, sum0 := NULL ]
# group Time Val1 Val2
# <char> <int> <num> <num>
# 1: A 1 0.0 0.0
# 2: A 3 0.0 0.0
# 3: A 4 0.0 0.1
# 4: A 5 0.0 0.0
# 5: A 7 0.0 0.0
# 6: B 1 0.1 0.0
# 7: B 2 0.2 0.2
# 8: B 3 0.0 0.0
# 9: B 4 0.0 0.0
# 10: B 5 0.1 0.2
# 11: B 6 0.1 0.5
Per your comment, both A-2 and A-6 have been removed.
Efficiencies:
rowSums
is fast and efficient;- We shift using direct indexing with a default of
0
; indata.table
, this is handled very efficiently, and does not incur the (admittedly small) overhead oflead
/lag
/shift
calls; - After we sum a row, we only row-shift this one value instead of four row-shifts per row.
CodePudding user response:
library(tidyverse)
df0 %>%
arrange(group, Time) %>% # EDIT to arrange by time (and group for clarity)
rowwise() %>%
mutate(sum = sum(c_across(Val1:Val2))) %>%
group_by(group) %>%
filter( !(sum == 0 & lag(sum, default = 1) == 0 & lead(sum, default = 1) == 0)) %>%
ungroup()
# A tibble: 11 x 5
group Time Val1 Val2 sum
<chr> <int> <dbl> <dbl> <dbl>
1 A 1 0 0 0
2 A 3 0 0 0
3 A 4 0 0.1 0.1
4 A 5 0 0 0
5 A 7 0 0 0
6 B 1 0.1 0 0.1
7 B 2 0.2 0.2 0.4
8 B 3 0 0 0
9 B 4 0 0 0
10 B 5 0.1 0.2 0.3
11 B 6 0.1 0.5 0.6