I have a character vector that looks like this:
my_vec
[1] "072:g__Caulobacter"
[2] "073:g__Romboutsia"
[3] "074:g__Blastocatella"
[4] "076:c__Deltaproteobacteria"
[5] "077:g__Tatumella"
[6] "078:g__Fretibacterium"
I want to cut the prefix so the result is the following:
[1] "Caulobacter"
[2] "Romboutsia"
[3] "Blastocatella"
[4] "Deltaproteobacteria"
[5] "tatumella"
[6] "Fretibacterium"
I think using regexps
is the way to do this but I'm not familiar with how to do this. The common pattern is the double __
.
CodePudding user response:
You can use word
from stringr
:
library(stringr)
word(my_vec, 2, sep = "__")
#[1] "Caulobacter" "Romboutsia" "Blastocatella"
#[4] "Deltaproteobacteria" "Tatumella" "Fretibacterium"
Another option is to use substring
, where regexpr
provides the position for __
, then we use substring
to get the rest of the word by using the starting position of 2
(the first letter after the underscores) to the end of the string using nchar
.
substring(my_vec, regexpr("__", my_vec) 2, nchar(my_vec))
Data
my_vec <- c("072:g__Caulobacter", "073:g__Romboutsia", "074:g__Blastocatella",
"076:c__Deltaproteobacteria", "077:g__Tatumella", "078:g__Fretibacterium")
CodePudding user response:
Does this work:
gsub('(\\d :[a-z]__)(.*)','\\2', vec)
[1] "Caulobacter" "Romboutsia" "Blastocatella" "Deltaproteobacteria" "Tatumella"
[6] "Fretibacterium"
CodePudding user response:
Another base R solution without needing a capture group is
my_vec <- c(
"072:g__Caulobacter",
"073:g__Romboutsia",
"074:g__Blastocatella",
"076:c__Deltaproteobacteria",
"077:g__Tatumella",
"078:g__Fretibacterium")
gsub("^. __", "", my_vec)
#[1] "Caulobacter" "Romboutsia" "Blastocatella"
#[4] "Deltaproteobacteria" "Tatumella" "Fretibacterium"
Explanation: "^. __"
matches from the start of each string ("^"
) any character substring of length > 0 (". "
) followed by a double underscore "__"
, and replaces this with an empty string ""
.
CodePudding user response:
I don't know R, but here would be a pure regex solution using two capturing groups: