I am trying to identify which groups in a column contain a specific sequence length of non-zero numbers. In the basic example below, where the goal is the find the groups with the a sequence length of 5, only group b
would be the correct.
set.seed(123)
df <- data.frame(
id = seq(1:40),
grp = sort(rep(letters[1:4], 10)),
x = c(
c(0, sample(1:10, 3), rep(0, 6)),
c(0, 0, sample(1:10, 5), rep(0, 3)),
c(rep(0, 6), sample(1:10, 4)),
c(0, 0, sample(1:10, 3), 0, sample(1:10, 2), 0, 0))
)
One limited solution is using cumsum
below, to find count the non-zero values but does not work when there are breaks in the sequence, such as the specific length being 5 and group d
being incorrectly included.
library(dplyr)
df %>%
group_by(grp) %>%
mutate(cc = cumsum(x != 0)) %>% filter(cc == 5) %>% distinct(grp)
Desired output for the example of a sequence length of 5, would identify only group b
, not d
.
CodePudding user response:
You may use rle
to find a consecutive non-zero numbers for each group.
library(dplyr)
find_groups <- function(x, n) {
tmp <- rle(x != 0)
any(tmp$lengths[tmp$values] >= n)
}
#apply the function for each group
df %>%
group_by(grp) %>%
dplyr::filter(find_groups(x, 5)) %>%
ungroup %>%
distinct(grp)
# grp
# <chr>
#1 b
CodePudding user response:
Split the sequence breaks into different groups by group by cumsum(x == 0)
. Then filter for the groups that contain 5 non-zero rows.
library(dplyr)
df %>%
group_by(grp, cumsum(x == 0)) %>%
filter(sum(x != 0) == 5) %>%
ungroup() %>%
distinct(grp)
#> # A tibble: 1 × 1
#> grp
#> <chr>
#> 1 b
CodePudding user response:
in data.table:
library(data.table)
setDT(df)[,.N==5,.(grp,rleid(!x))][(V1), .(grp)]
grp
1: b