finding the longest stretch of rows of data entries by grouping variable using tidyverse/ other R co-CodePudding

I am not sure if I describe my question with the correct title but the idea is:

I would like to longest stretch of rows of data entries of each group after using group_by() which is also sensitive to the current order of rows. In other words, there are a (or multiple) discontinuities within a group (e.g. after arrange() by some other columns). I would like to get a new column (e.g. mutate()) that labels the rows that are within the longest stretch of each group. below is an example:

data.frame(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 3, 1, 2, 2, 2),
           order = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17))

In which, I would like to get a data frame like the following:

data.frame(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 3, 1, 2, 2, 2),
           order = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17),
           longest = c(T, T, T, F, F, T, T, T, T, T, F, F, F, F, T, T, T))

CodePudding user response：

in Base R:

df$longest <- with(rle(df$group), 
                    rep(ave(lengths, values, FUN = max) == lengths,lengths))

df
   group order longest
1      1     1    TRUE
2      1     2    TRUE
3      1     3    TRUE
4      2     4   FALSE
5      2     5   FALSE
6      3     6    TRUE
7      3     7    TRUE
8      3     8    TRUE
9      3     9    TRUE
10     3    10    TRUE
11     1    11   FALSE
12     1    12   FALSE
13     3    13   FALSE
14     1    14   FALSE
15     2    15    TRUE
16     2    16    TRUE
17     2    17    TRUE

Another Base R:

a <- rle(df$group)
a$values <- ave(a$lengths, a$values, FUN = max) == a$lengths

df$longest <- inverse.rle(a)

In data.table:

library(data.table)
setDT(df)[, N := .N, by = rleid(group)][, longest := N == max(N), by = group][]

   group order N longest
 1:     1     1 3    TRUE
 2:     1     2 3    TRUE
 3:     1     3 3    TRUE
 4:     2     4 2   FALSE
 5:     2     5 2   FALSE
 6:     3     6 5    TRUE
 7:     3     7 5    TRUE
 8:     3     8 5    TRUE
 9:     3     9 5    TRUE
10:     3    10 5    TRUE
11:     1    11 2   FALSE
12:     1    12 2   FALSE
13:     3    13 1   FALSE
14:     1    14 1   FALSE
15:     2    15 3    TRUE
16:     2    16 3    TRUE
17:     2    17 3    TRUE

CodePudding user response：

We could create a group for the consecutive values in the group column. Then, get the number of rows for those groups, then we can group by group and return TRUE for the rows that have the greatest number of consecutive rows for each group.

library(tidyverse)
  
df %>% 
  group_by(group_weight = cumsum(c(1, diff(group) != 0))) %>% 
  mutate(longest = n()) %>% 
  group_by(group) %>% 
  mutate(longest = longest == max(longest)) %>% 
  ungroup %>% 
  select(-group_weight)

Output

   group order longest
   <dbl> <dbl> <lgl>  
 1     1     1 TRUE   
 2     1     2 TRUE   
 3     1     3 TRUE   
 4     2     4 FALSE  
 5     2     5 FALSE  
 6     3     6 TRUE   
 7     3     7 TRUE   
 8     3     8 TRUE   
 9     3     9 TRUE   
10     3    10 TRUE   
11     1    11 FALSE  
12     1    12 FALSE  
13     3    13 FALSE  
14     1    14 FALSE  
15     2    15 TRUE   
16     2    16 TRUE   
17     2    17 TRUE

If you have a tie among consecutive rows and only want to return the first grouping as T, then you could do something like this:

df2 %>% 
  group_by(group_weight = cumsum(c(1, diff(group) != 0))) %>% 
  mutate(longest = n()) %>% 
  group_by(group) %>% 
  mutate(longest = longest==max(longest)) %>% 
  group_by(longest, .add = TRUE) %>% 
  mutate(x = min(group_weight)) %>% 
  ungroup(longest) %>% 
  mutate(longest = longest == TRUE & group_weight == x & !is.na(x)) %>% 
  ungroup %>% 
  dplyr::select(-c(group_weight, x))

Output

   group order longest
   <dbl> <dbl> <lgl>  
 1     1     1 TRUE   
 2     1     2 TRUE   
 3     1     3 TRUE   
 4     2     4 FALSE  
 5     2     5 FALSE  
 6     3     6 TRUE   
 7     3     7 TRUE   
 8     3     8 TRUE   
 9     3     9 TRUE   
10     3    10 TRUE   
11     1    11 FALSE  
12     1    12 FALSE  
13     3    13 FALSE  
14     1    14 FALSE  
15     2    15 TRUE   
16     2    16 TRUE   
17     2    17 TRUE   
18     1    18 FALSE  
19     1    19 FALSE  
20     1    20 FALSE

Data

df2 <- structure(list(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 
3, 1, 2, 2, 2, 1, 1, 1), order = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)), class = "data.frame", row.names = c(NA, 
-20L))