easy way to extract uppercase in string in R-CodePudding

I am beginner programmer in R.

I have "cCt/cGt" and I want to extract C and G and write it like C>G.

test ="cCt/cGt"
str_extract(test, "[A-Z] $")

CodePudding user response：

Try this:

gsub(".*([A-Z]).*([A-Z]).*", "\\1>\\2", test )
[1] "C>G"

Here, we capture the two occurrences of the upper case letters in capturing groups given in parentheses (...). This enables us to refer to them (and only to them but not the rest of the string!) in gsub's replacement clause using backreferences \\1 and \\2. In the replacement clause we also include the desired >.

CodePudding user response：

You seem to look for a mutation in two concatenated strings, this function should solve your problem:

extract_mutation <- function(text){
  splitted <- strsplit(text, split = "/")[[1]] 
  pos <- regexpr("[[:upper:]]", splitted)
  uppercases <- regmatches(splitted, pos)
  mutation <- paste0(uppercases, collapse = ">") 
  return(mutation)
}

Note that if you expect more than one uppercase letter in the input, you should use gregexpr instead of regexpr.

CodePudding user response：

You might also capture the 2 uppercase chars followed and preceded by optional lowercase characters and a / in between.

test ="cCt/cGt"
res = str_match(test, "([A-Z])[a-z]*/[a-z]*([A-Z])")
sprintf("%s>%s", res[2], res[3])

Output

[1] "C>G"

See an R demo.

An exact match for the whole string could be:

^[a-z]([A-Z])[a-z]/[a-z]([A-Z])[a-z]$