Home > Mobile >  start and end positions of a character; R
start and end positions of a character; R

Time:10-27

I am trying to get

  1. start and end positions of "-" character in column V1
  2. and its corresponding characters at these positions in column V2
  3. Then length of it

Any help will be appreciated!

ip <- structure(list(V1 = c("ab---cdef", "abcd---ef", "a--bc--def"), 
    V2 = c("xxxxxxxyy", "xxxxxyyyy", "xxxyyyzzzz")), class = "data.frame", row.names = c(NA, 
-3L))

I tried stringi_locate but it outputs for individual position. For example, For this "ab---cdef" instead of 3-5 it outputs 3-3, 4-4, 5-5.

Expected output:

op <- structure(list(V1 = c("ab---cdef", "abcd---ef", "a--bc--def"), 
    V2 = c("xxxxxxxyy", "xxxxxyyyy", "xxxyyyzzzz"), output = c("x:x-3:5-3", 
    "x:y-5:7-3", "x:x-2:3-2; y-z:6:7-2")), class = "data.frame", row.names = c(NA, 
-3L))

the output column must have

  1. The characters in V2 column with respect to start and end of "-" in V1
  2. Then start and end position
  3. Then its length
   V1          V2           output
ab---cdef    xxxxxxxyy     x:x-3:5-3

Thanks!

CodePudding user response:

Here's an example using grepexpr to get all the matches in a string.

x <- gregexpr("- ", ip$V1)
mapply(function(m, s, r) {
  start <- m
  len <- attr(m, "match.length")
  end <- start   len-1
  part <- mapply(substr, r, start, end)
  paste0(part, "-", start, ":", end, "-", len, collapse=";")
  
}, x, ip$V1, ip$V2)
# [1] "xxx-3:5-3"         
# [2] "xyy-5:7-3"        
# [3] "xx-2:3-2;yz-6:7-2"

I'm not sure what your logic was for turning xxx into x:x or xyy to x-y or how that generalized to other sequences so feel free to change that part. But you can get the start and length of the matches using the attributes of the returned match object. It's just important to use - as the pattern so you match a run of dashes rather than just a single dash.

  •  Tags:  
  • r
  • Related