Home > database >  How to replace the wild card characters with sampled characters in R
How to replace the wild card characters with sampled characters in R

Time:10-23

I have the following sequence:

s0 <- "KDRH?THLA???RT?HLAK"

The wild card character there is indicated by ?. What I want to do is to replace that character by sampled character from this vector:

AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

Since s0 has 5 wild cards ?, I would sample from AADict:

set.seed(1)
nof_wildcard <- 5
tolower(sample(AADict, nof_wildcard, TRUE))

Which gives [1] "d" "q" "a" "r" "l"

Hence the expected result is:

     KDRH?THLA???RT?HLAK
     KDRHdTHLAqarRTlHLAK

So the placement of the sampled character must be exactly in the same position as ?, but the order of the character is not important. e.g. this answer is also acceptable: KDRHqTHLAdlaRTrHLAK.

How can I achieve that with R?

The other example are:

s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
s2 <- "FKHIDVKDRHRTRHLAK??????????"

CodePudding user response:

One approach is to replace the "?" characters 'one at a time' using a loop, e.g.

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
s0
#> [1] "KDRH?THLA???RT?HLAK"
repeat{s0 <- sub("\\?", sample(tolower(AADict), 1), s0); if(grepl("\\?", s0) == FALSE) break}
s0
#> [1] "KDRHtTHLAidwRTyHLAK"

s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
repeat{s1 <- sub("\\?", sample(tolower(AADict), 1), s1); if(grepl("\\?", s1) == FALSE) break}
s1
#> [1] "FKDHKHIDVKDRHRTHLAKrstaRTRHLAK"

s2 <- "FKHIDVKDRHRTRHLAK??????????"
repeat{s2 <- sub("\\?", sample(tolower(AADict), 1), s2); if(grepl("\\?", s2) == FALSE) break}
s2
#> [1] "FKHIDVKDRHRTRHLAKdvcfmheiqn"

Another approach which can also allow for sampling without replacement:

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
matches <- gregexpr("\\?", s0)
regmatches(s0, matches) <- lapply(lengths(matches), sample, x = tolower(AADict), replace = FALSE)
s0
#> [1] "KDRHdTHLAlanRTiHLAK"

Created on 2022-10-22 by the reprex package (v2.0.1)

CodePudding user response:

You could split your string in single characters which makes it easy to replace the wildcard without the need of a loop (was my first approach):

replace_wc <- function(x, dict) {
  x <- strsplit(x, split = "")[[1]]
  ix <- grepl("\\?", x)
  x[ix] <- sample(dict, sum(ix), replace = TRUE)

  return(paste0(x, collapse = ""))
}

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c(
  "A", "R", "N", "D", "C", "E", "Q", "G", "H",
  "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)

set.seed(1)

replace_wc(s0, tolower(AADict))
#> [1] "KDRHdTHLAqarRTlHLAK"

CodePudding user response:

Here is a vectorized function to replace the "?" characters in a vector of strings.

fun <- function(x, dict = AADict) {
  dict <- tolower(dict)
  inx <- gregexpr("\\?", x)
  sapply(seq_along(x), \(j) {
    for(i in inx[[j]]) {
      substr(x[j], i, i) <- sample(dict, 1L)
    }
    x[j]
  })
}

AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

s0 <- "KDRH?THLA???RT?HLAK"
s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
s2 <- "FKHIDVKDRHRTRHLAK??????????"

fun(s0)
#> [1] "KDRHsTHLAwppRTwHLAK"

fun(s1)
#> [1] "FKDHKHIDVKDRHRTHLAKyfqfRTRHLAK"

fun(s2)
#> [1] "FKHIDVKDRHRTRHLAKnsfehqwmkv"

fun(c(s0, s1, s2))
#> [1] "KDRHiTHLAdssRTgHLAK"            "FKDHKHIDVKDRHRTHLAKcdivRTRHLAK"
#> [3] "FKHIDVKDRHRTRHLAKfrpafwpnif"

Created on 2022-10-22 with reprex v2.0.2

  • Related