How to delete all the substrings till match double "_

I have a character vector that looks like this:

my_vec
                                       
 [1] "072:g__Caulobacter"                                      
 [2] "073:g__Romboutsia"                                       
 [3] "074:g__Blastocatella"                                    
 [4] "076:c__Deltaproteobacteria"                              
 [5] "077:g__Tatumella"                                        
 [6] "078:g__Fretibacterium"

I want to cut the prefix so the result is the following:

 [1] "Caulobacter"                                      
 [2] "Romboutsia"                                       
 [3] "Blastocatella"                                    
 [4] "Deltaproteobacteria"                              
 [5] "tatumella"                                        
 [6] "Fretibacterium"

I think using regexps is the way to do this but I'm not familiar with how to do this. The common pattern is the double __.

CodePudding user response：

You can use word from stringr:

library(stringr)

word(my_vec, 2, sep = "__")

#[1] "Caulobacter"         "Romboutsia"          "Blastocatella"      
#[4] "Deltaproteobacteria" "Tatumella"           "Fretibacterium"

Another option is to use substring, where regexpr provides the position for __, then we use substring to get the rest of the word by using the starting position of 2 (the first letter after the underscores) to the end of the string using nchar.

substring(my_vec, regexpr("__", my_vec)   2, nchar(my_vec))

Data

my_vec <- c("072:g__Caulobacter", "073:g__Romboutsia", "074:g__Blastocatella", 
"076:c__Deltaproteobacteria", "077:g__Tatumella", "078:g__Fretibacterium")

CodePudding user response：

Does this work:

gsub('(\\d :[a-z]__)(.*)','\\2', vec)
[1] "Caulobacter"         "Romboutsia"          "Blastocatella"       "Deltaproteobacteria" "Tatumella"          
[6] "Fretibacterium"

CodePudding user response：

Another base R solution without needing a capture group is

my_vec <- c(
    "072:g__Caulobacter",
    "073:g__Romboutsia",
    "074:g__Blastocatella",
    "076:c__Deltaproteobacteria",
    "077:g__Tatumella",
    "078:g__Fretibacterium")

gsub("^. __", "", my_vec)
#[1] "Caulobacter"         "Romboutsia"          "Blastocatella"      
#[4] "Deltaproteobacteria" "Tatumella"           "Fretibacterium"

Explanation: "^. __" matches from the start of each string ("^") any character substring of length > 0 (". ") followed by a double underscore "__", and replaces this with an empty string "".

CodePudding user response：

I don't know R, but here would be a pure regex solution using two capturing groups: