I have the following sequence:
my_seq <- "----?????-----?V?D????-------???IL??A?---"
What I want to do is to detect range of positions of non-dashed characters.
----?????-----?V?D????-------???IL??A?---
| | | | | | |
1 5 9 15 22 30 38
The final output will be a vector of strings:
out <- c("5-9", "15-22", "30-38")
How can I achieve that with R?
CodePudding user response:
Please find below, one other possible solution using the stringr
library
Reprex
- Code
library(stringr)
s <- as.data.frame(str_locate_all(my_seq, "[^-] ")[[1]])
result <- paste(s$start, s$end, sep ="-")
- Output
result
#> [1] "5-9" "15-22" "30-38"
Created on 2022-02-18 by the reprex package (v2.0.1)
CodePudding user response:
You could do:
my_seq <- "----?????-----?V?D????-------???IL??A?---"
non_dash <- which(strsplit(my_seq, "")[[1]] != '-')
pos <- non_dash[c(0, diff(non_dash)) != 1 | c(diff(non_dash), 0) != 1]
apply(matrix(pos, ncol = 2, byrow = TRUE), 1, function(x) paste(x, collapse = "-"))
#> [1] "5-9" "15-22" "30-38"
Created on 2022-02-18 by the reprex package (v2.0.1)
CodePudding user response:
Here is a rle tidyverse approach:
library(dplyr)
with(rle(strsplit(my_seq, "")[[1]] != "-"),
data.frame(lengths, values)) |>
mutate(end = cumsum(lengths)) |>
mutate(start = 1 lag(end, 1,0)) |>
mutate(rng = paste(start, end, sep = "-")) |>
filter(values) |>
pull(rng)
[1] "5-9" "15-22" "30-38"
However if you don't mind installing S4Vectors
the code can be made really terse:
library(S4Vectors)
r <- Rle(strsplit(my_seq, "")[[1]] != "-")
paste(start(r), end(r), sep = "-")[runValue(r)]
[1] "5-9" "15-22" "30-38"
CodePudding user response:
Inspired from @lovalery's great answer, a base R
solution is:
g <- gregexpr(pattern = "[^-] ", my_seq)
d <-data.frame(start = unlist(g),
end = unlist(g) attr(g[[1]], "match.length") - 1)
paste(s$start, s$end, sep ="-")
# [1] "1-5" "11-18" "26-34"
CodePudding user response:
A one-liner in base R with utf8ToInt
apply(matrix(which(diff(c(FALSE, utf8ToInt(my_seq) != 45L, FALSE)) != 0) - 0:1, 2), 2, paste, collapse = "-")
#> [1] "5-9" "15-22" "30-38"
CodePudding user response:
Try
paste0(gregexec('-\\?', my_seq)[[1]][1,] 1, '-',
gregexec('\\?-', my_seq)[[1]][1,])
#> [1] "5-9" "15-22" "30-38"