identify which group contains sequence of non-zero values-CodePudding

I am trying to identify which groups in a column contain a specific sequence length of non-zero numbers. In the basic example below, where the goal is the find the groups with the a sequence length of 5, only group b would be the correct.

set.seed(123)
df <- data.frame(
  id = seq(1:40),
  grp = sort(rep(letters[1:4], 10)),
  x = c(
    c(0, sample(1:10, 3), rep(0, 6)), 
    c(0, 0, sample(1:10, 5), rep(0, 3)), 
    c(rep(0, 6), sample(1:10, 4)),
    c(0, 0, sample(1:10, 3), 0, sample(1:10, 2), 0, 0))
)

One limited solution is using cumsum below, to find count the non-zero values but does not work when there are breaks in the sequence, such as the specific length being 5 and group d being incorrectly included.

library(dplyr)
df %>% 
  group_by(grp) %>% 
  mutate(cc = cumsum(x != 0)) %>% filter(cc == 5) %>% distinct(grp)

Desired output for the example of a sequence length of 5, would identify only group b, not d.

CodePudding user response：

You may use rle to find a consecutive non-zero numbers for each group.

library(dplyr)

find_groups <- function(x, n) {
  tmp <- rle(x != 0)
  any(tmp$lengths[tmp$values] >= n)
}

#apply the function for each group
df %>% 
  group_by(grp) %>%
  dplyr::filter(find_groups(x, 5)) %>%
  ungroup %>%
  distinct(grp)

#   grp  
#  <chr>
#1 b

CodePudding user response：

Split the sequence breaks into different groups by group by cumsum(x == 0). Then filter for the groups that contain 5 non-zero rows.

library(dplyr)

df %>% 
  group_by(grp, cumsum(x == 0)) %>%
  filter(sum(x != 0) == 5) %>%
  ungroup() %>% 
  distinct(grp)

#> # A tibble: 1 × 1
#>   grp  
#>   <chr>
#> 1 b

CodePudding user response：

in data.table:

library(data.table)
setDT(df)[,.N==5,.(grp,rleid(!x))][(V1), .(grp)]

   grp
1:   b