Summarize a data frame based on consecutive rows with repeated values-CodePudding

I have a data frame with the following structure:

pos<- c(67,125,158,195,235,458,499,526,785,912,999,1525)
v_1<-c("j","c","v","r","s","q","r","r","s","t","u","v")
v_2<-c("c","t","v","r","s","q","r","w","c","c","o","v")
v_3<-c("z","c","v","r","s","q","r","w","c","b","p","v")
v_4<-c("x","w","z","z","s","q","r","w","c","o","t","v")
data<-as.data.frame(cbind(pos,v_1,v_2,v_3,v_4))

In this dataframe it is possible to find the same letters among the different columns in consecutive rows. I need to obtain a separate data frame with the values of the variable "pos" for consecutive rows with shared letters, as can be seen in the figure:

In this figure even though all the columns have the same letter in pos 1525, this row isn’t included since it’s not consecutive with another row with repeated letters.

CodePudding user response：

Solution using tidyr and dplyr:

After pivoting to long, use dplyr::add_count() to find repeated values within each pos;
Within each v, find consecutive rows with repeated values, defined as: >1 repeat and >1 repeat in either preceding or following row;
Create a column containing pos for consecutive rows and NA otherwise;
Take the minimum and maximum to get start and end for each v.

library(tidyr)
library(dplyr)

data %>%
  pivot_longer(!pos, names_to = "v") %>%
  add_count(pos, value) %>%
  group_by(v) %>%
  mutate(consec = ifelse(
    n > 1 & (lag(n) > 1 | lead(n) > 1),
    pos,
    NA
  )) %>%
  summarize(
    start = min(consec, na.rm = TRUE),
    end = max(consec, na.rm = TRUE)
  )

# A tibble: 4 × 3
  v     start end  
  <chr> <chr> <chr>
1 v_1   125   499  
2 v_2   158   785  
3 v_3   125   785  
4 v_4   235   785

Note, not sure if/how you want to handle if there is more than one set of consecutive rows, so this solution doesn’t address that.