Home > OS >  How to detect range of positions of specific set of characters in a string
How to detect range of positions of specific set of characters in a string

Time:02-19

I have the following sequence:

my_seq <- "----?????-----?V?D????-------???IL??A?---"

What I want to do is to detect range of positions of non-dashed characters.

----?????-----?V?D????-------???IL??A?---
|   |   |     |      |       |       |  
1   5   9    15     22      30      38

The final output will be a vector of strings:

out <- c("5-9", "15-22", "30-38")

How can I achieve that with R?

CodePudding user response:

Please find below, one other possible solution using the stringr library

Reprex

  • Code
library(stringr)

s <- as.data.frame(str_locate_all(my_seq, "[^-] ")[[1]])
result <- paste(s$start, s$end, sep ="-")
  • Output
result
#> [1] "5-9"   "15-22" "30-38"

Created on 2022-02-18 by the reprex package (v2.0.1)

CodePudding user response:

You could do:

my_seq <- "----?????-----?V?D????-------???IL??A?---"

non_dash <- which(strsplit(my_seq, "")[[1]] != '-')
pos      <- non_dash[c(0, diff(non_dash)) != 1 | c(diff(non_dash), 0) != 1]

apply(matrix(pos, ncol = 2, byrow = TRUE), 1, function(x) paste(x, collapse = "-"))
#> [1] "5-9"   "15-22" "30-38"

Created on 2022-02-18 by the reprex package (v2.0.1)

CodePudding user response:

Here is a rle tidyverse approach:

library(dplyr)
with(rle(strsplit(my_seq, "")[[1]] != "-"),
     data.frame(lengths, values)) |>
  mutate(end = cumsum(lengths)) |>
  mutate(start =  1   lag(end, 1,0)) |>
  mutate(rng = paste(start, end, sep = "-")) |>
  filter(values) |>
  pull(rng)

[1] "5-9"   "15-22" "30-38"

However if you don't mind installing S4Vectors the code can be made really terse:

library(S4Vectors)

r <- Rle(strsplit(my_seq, "")[[1]] != "-")

paste(start(r), end(r), sep = "-")[runValue(r)]

[1] "5-9"   "15-22" "30-38"

CodePudding user response:

Inspired from @lovalery's great answer, a base R solution is:

g <- gregexpr(pattern = "[^-] ", my_seq)
d <-data.frame(start = unlist(g), 
           end = unlist(g)   attr(g[[1]], "match.length") - 1)
paste(s$start, s$end, sep ="-")
# [1] "1-5"   "11-18" "26-34"

CodePudding user response:

A one-liner in base R with utf8ToInt

apply(matrix(which(diff(c(FALSE, utf8ToInt(my_seq) != 45L, FALSE)) != 0) - 0:1, 2), 2, paste, collapse = "-")
#> [1] "5-9"   "15-22" "30-38"

CodePudding user response:

Try

paste0(gregexec('-\\?', my_seq)[[1]][1,]   1, '-',
       gregexec('\\?-', my_seq)[[1]][1,])
#> [1] "5-9"   "15-22" "30-38"
  • Related