I am trying to get
- start and end positions of "-" character in column V1
- and its corresponding characters at these positions in column V2
- Then length of it
Any help will be appreciated!
ip <- structure(list(V1 = c("ab---cdef", "abcd---ef", "a--bc--def"),
V2 = c("xxxxxxxyy", "xxxxxyyyy", "xxxyyyzzzz")), class = "data.frame", row.names = c(NA,
-3L))
I tried stringi_locate but it outputs for individual position. For example, For this "ab---cdef" instead of 3-5 it outputs 3-3, 4-4, 5-5.
Expected output:
op <- structure(list(V1 = c("ab---cdef", "abcd---ef", "a--bc--def"),
V2 = c("xxxxxxxyy", "xxxxxyyyy", "xxxyyyzzzz"), output = c("x:x-3:5-3",
"x:y-5:7-3", "x:x-2:3-2; y-z:6:7-2")), class = "data.frame", row.names = c(NA,
-3L))
the output column must have
- The characters in V2 column with respect to start and end of "-" in V1
- Then start and end position
- Then its length
V1 V2 output
ab---cdef xxxxxxxyy x:x-3:5-3
Thanks!
CodePudding user response:
Here's an example using grepexpr
to get all the matches in a string.
x <- gregexpr("- ", ip$V1)
mapply(function(m, s, r) {
start <- m
len <- attr(m, "match.length")
end <- start len-1
part <- mapply(substr, r, start, end)
paste0(part, "-", start, ":", end, "-", len, collapse=";")
}, x, ip$V1, ip$V2)
# [1] "xxx-3:5-3"
# [2] "xyy-5:7-3"
# [3] "xx-2:3-2;yz-6:7-2"
I'm not sure what your logic was for turning xxx
into x:x
or xyy
to x-y
or how that generalized to other sequences so feel free to change that part. But you can get the start and length of the matches using the attributes of the returned match object. It's just important to use -
as the pattern so you match a run of dashes rather than just a single dash.