Home > Net >  How to replace a string with another with interleaving characters in R
How to replace a string with another with interleaving characters in R

Time:11-11

I have the following strings:

x  <- "??????????DRHRTRHLAK??????????"
x2 <- "????????????????????TRCYHIDPHH"
x3 <- "FKDHKHIDVK????????????????????TRCYHIDPHH"
x4 <- "FKDHKHIDVK????????????????????"

What I want to do is to replace all the ? characters with another string

rep <- "ndqeegillkkkkfpssyvv"

Resulting in:

ndqeegillkDRHRTRHLAKkkkfpssyvv           # x
ndqeegillkkkkfpssyvvTRCYHIDPHH           # x2
FKDHKHIDVKndqeegillkkkkfpssyvvTRCYHIDPHH # x3
FKDHKHIDVKndqeegillkkkkfpssyvv           # x4

Basically, keeping the order of rep in the replacement with the interleaving characters DRHRTRHLAK in x.

The total length of rep is the same as the total length of ?, 20 characters.

Note that I don't want to split rep manually again as an extra step.

I tried this but failed:

>gsub(pattern = "\\? ", replacement = rep, x = x)
[1] "ndqeegillkkkkfpssyvvDRHRTRHLAKndqeegillkkkkfpssyvv"

CodePudding user response:

You can count the number of ?'s and then cut rep based on that:

x <- "??????????DRHRTRHLAK??????????"
rep <- "ndqeegillkkkkfpssyvv"

pattern <- "(\\? )(DRHRTRHLAK)(\\? )"
n <- nchar(gsub(pattern, "\\1", x))

gsub(pattern, paste0(substr(rep, 1, n), "\\2", substr(rep, n 1, nchar(rep))), x)
#[1] "ndqeegillk??????????kkkfpssyvv"

Edit: new examples:

A very verbose way is to do a if else chain, checking where the ?'s are, and substituting rep accordingly.

if(grepl("^\\?. \\?$", x)){ #?'s on both ends
  n <- gsub(pattern, "\\1", x) %>% nchar()
  gsub(pattern, paste0(substr(rep, 1, n), "\\2", substr(rep, n 1, nchar(rep))), x)
} else if(grepl("^\\?", x)){ #?'s only on start
  n <- gsub(pattern, "\\1", x) %>% nchar()
  gsub(pattern, paste0(substr(rep, 1, n), "\\2"), x)
} else if(grepl("\\?$", x)){ #?'s only on end
  n <- gsub(pattern, "\\2", x) %>% nchar()
  gsub(pattern, paste0("\\2", substr(rep, 1, n)), x)
} else if(grepl("^[A-Z] \\? [A-Z] $", x)){ #?'s only on middle
  n <- gsub(pattern, "\\2", x) %>% nchar()
  gsub("([A-Z] )\\? ([A-Z] )", paste0("\\1", substr(rep, 1, n), "\\2"), x)
}

CodePudding user response:

String Split with substr():

x <- "??????????DRHRTRHLAK??????????"
rep <- "ndqeegillkkkkfpssyvv"
x<-gsub(pattern = "^\\? ", replacement = substr(rep, 1, 10), x = x)
x<-gsub(pattern = "\\? $", replacement = substr(rep, 11, 20), x = x)
x
#[1] "ndqeegillkDRHRTRHLAKkkkfpssyvv"

Regex ^ matches start, and $ matches end.

CodePudding user response:

Example data:

x <- c(
    "??????????DRHRTRHLAK??????????",
    "????????????????????TRCYHIDPHH",
    "FKDHKHIDVK????????????????????TRCYHIDPHH"
)
rep <- "ndqeegillkkkkfpssyvv"

Fix it up with regmatches<- replacements in a vectorised fashion:

gr <- gregexpr("\\? ", x)
csml <- lapply(gr, \(x) cumsum(attr(x, "match.length")) )
regmatches(x, gr) <- lapply(csml, \(x) substring(rep, c(1,x[1]), x)  )
##[1] "ndqeegillkDRHRTRHLAKkkkkfpssyvv"         
##[2] "ndqeegillkkkkfpssyvvTRCYHIDPHH"          
##[3] "FKDHKHIDVKndqeegillkkkkfpssyvvTRCYHIDPHH"
  • Related