I have a data frame with the following structure:
pos<- c(67,125,158,195,235,458,499,526,785,912,999,1525)
v_1<-c("j","c","v","r","s","q","r","r","s","t","u","v")
v_2<-c("c","t","v","r","s","q","r","w","c","c","o","v")
v_3<-c("z","c","v","r","s","q","r","w","c","b","p","v")
v_4<-c("x","w","z","z","s","q","r","w","c","o","t","v")
data<-as.data.frame(cbind(pos,v_1,v_2,v_3,v_4))
In this dataframe it is possible to find the same letters among the different columns in consecutive rows. I need to obtain a separate data frame with the values of the variable "pos" for consecutive rows with shared letters, as can be seen in the figure:
In this figure even though all the columns have the same letter in pos 1525, this row isn’t included since it’s not consecutive with another row with repeated letters.
CodePudding user response:
Solution using tidyr and dplyr:
- After pivoting to long, use
dplyr::add_count()
to find repeated values within eachpos
; - Within each
v
, find consecutive rows with repeated values, defined as: >1 repeat and >1 repeat in either preceding or following row; - Create a column containing
pos
for consecutive rows andNA
otherwise; - Take the minimum and maximum to get
start
andend
for eachv
.
library(tidyr)
library(dplyr)
data %>%
pivot_longer(!pos, names_to = "v") %>%
add_count(pos, value) %>%
group_by(v) %>%
mutate(consec = ifelse(
n > 1 & (lag(n) > 1 | lead(n) > 1),
pos,
NA
)) %>%
summarize(
start = min(consec, na.rm = TRUE),
end = max(consec, na.rm = TRUE)
)
# A tibble: 4 × 3
v start end
<chr> <chr> <chr>
1 v_1 125 499
2 v_2 158 785
3 v_3 125 785
4 v_4 235 785
Note, not sure if/how you want to handle if there is more than one set of consecutive rows, so this solution doesn’t address that.