I have the following sequence:
s0 <- "KDRH?THLA???RT?HLAK"
The wild card character there is indicated by ?
.
What I want to do is to replace that character by sampled character from this vector:
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
Since s0
has 5 wild cards ?
, I would sample from AADict:
set.seed(1)
nof_wildcard <- 5
tolower(sample(AADict, nof_wildcard, TRUE))
Which gives [1] "d" "q" "a" "r" "l"
Hence the expected result is:
KDRH?THLA???RT?HLAK
KDRHdTHLAqarRTlHLAK
So the placement of the sampled character must be exactly in the same position as ?
, but the order of the character is not important.
e.g. this answer is also acceptable: KDRHqTHLAdlaRTrHLAK
.
How can I achieve that with R?
The other example are:
s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
s2 <- "FKHIDVKDRHRTRHLAK??????????"
CodePudding user response:
One approach is to replace the "?" characters 'one at a time' using a loop, e.g.
s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
s0
#> [1] "KDRH?THLA???RT?HLAK"
repeat{s0 <- sub("\\?", sample(tolower(AADict), 1), s0); if(grepl("\\?", s0) == FALSE) break}
s0
#> [1] "KDRHtTHLAidwRTyHLAK"
s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
repeat{s1 <- sub("\\?", sample(tolower(AADict), 1), s1); if(grepl("\\?", s1) == FALSE) break}
s1
#> [1] "FKDHKHIDVKDRHRTHLAKrstaRTRHLAK"
s2 <- "FKHIDVKDRHRTRHLAK??????????"
repeat{s2 <- sub("\\?", sample(tolower(AADict), 1), s2); if(grepl("\\?", s2) == FALSE) break}
s2
#> [1] "FKHIDVKDRHRTRHLAKdvcfmheiqn"
Another approach which can also allow for sampling without replacement:
s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
matches <- gregexpr("\\?", s0)
regmatches(s0, matches) <- lapply(lengths(matches), sample, x = tolower(AADict), replace = FALSE)
s0
#> [1] "KDRHdTHLAlanRTiHLAK"
Created on 2022-10-22 by the reprex package (v2.0.1)
CodePudding user response:
You could split your string in single characters which makes it easy to replace the wildcard without the need of a loop (was my first approach):
replace_wc <- function(x, dict) {
x <- strsplit(x, split = "")[[1]]
ix <- grepl("\\?", x)
x[ix] <- sample(dict, sum(ix), replace = TRUE)
return(paste0(x, collapse = ""))
}
s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
set.seed(1)
replace_wc(s0, tolower(AADict))
#> [1] "KDRHdTHLAqarRTlHLAK"
CodePudding user response:
Here is a vectorized function to replace the "?"
characters in a vector of strings.
fun <- function(x, dict = AADict) {
dict <- tolower(dict)
inx <- gregexpr("\\?", x)
sapply(seq_along(x), \(j) {
for(i in inx[[j]]) {
substr(x[j], i, i) <- sample(dict, 1L)
}
x[j]
})
}
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
s0 <- "KDRH?THLA???RT?HLAK"
s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
s2 <- "FKHIDVKDRHRTRHLAK??????????"
fun(s0)
#> [1] "KDRHsTHLAwppRTwHLAK"
fun(s1)
#> [1] "FKDHKHIDVKDRHRTHLAKyfqfRTRHLAK"
fun(s2)
#> [1] "FKHIDVKDRHRTRHLAKnsfehqwmkv"
fun(c(s0, s1, s2))
#> [1] "KDRHiTHLAdssRTgHLAK" "FKDHKHIDVKDRHRTHLAKcdivRTRHLAK"
#> [3] "FKHIDVKDRHRTRHLAKfrpafwpnif"
Created on 2022-10-22 with reprex v2.0.2