Home > Back-end >  get the 3rd word for each element in a character vector
get the 3rd word for each element in a character vector

Time:06-15

I have the following character vector called strains :

 head(strains, 10)

 [1] "Lactobacillus gasseri APC678"                    "Lactobacillus gasseri DSM 20243"                
 [3] "Bifidobacterium angulatum B677"                  "Bifidobacterium breve Reuter S1"                
 [5] "Lactobacillus reuteri F275"                      "Lactobacillus acidophilus L917"                 
 [7] "Lactobacillus acidophilus 4357"                  "Bifidobacterium pseudocatenulatum B1279"        
 [9] "Bifidobacterium longum subsp. infantis JCM 1210" "Clostridium difficile 43594"  

What I want to get is a vector with just the 3rd word for each element in the strains. For example, in the element called "Lactobacillus gasseri APC678", I would like to just keep "APC678".

What I did is the following :

library(tidyvese)

lapply(strains %>% str_split(" "), '[', 3) %>% unlist 

Which did the work I want, as you can see in the output my code gives :

 [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"   "B1279"  "subsp." "43594"  "subsp." "F275"   "1SL4"   "JCM"   
[15] "JCM"    "AM63"   "DSM"    "L917"   "61D"    "Bb14"   "AM63"   "VPI"

However, I'm looking for a more elegant or concise way to do the same, maybe using regex or something alike.


Here is the dput of my data :

strains <- c("Lactobacillus gasseri APC678", "Lactobacillus gasseri DSM 20243", 
"Bifidobacterium angulatum B677", "Bifidobacterium breve Reuter S1", 
"Lactobacillus reuteri F275", "Lactobacillus acidophilus L917", 
"Lactobacillus acidophilus 4357", "Bifidobacterium pseudocatenulatum B1279", 
"Bifidobacterium longum subsp. infantis JCM 1210", "Clostridium difficile 43594"
)

CodePudding user response:

There's a very simple word function from the stringr package for this without the need to use regex.

library(stringr)

stringr::word(strains, start = 3, end = 3)
 [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"  
 [8] "B1279"  "subsp." "43594" 

CodePudding user response:

You can use stringr package:

stringr::str_split(strains, " ", simplify = TRUE)[,3]

CodePudding user response:

With Base R and regex:

sub("^(\\S \\s){2}(\\S ).*", "\\2", strains)

With data.table:

data.table::tstrsplit(strains, " ")[[3]]
# [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"   "B1279"  "subsp." "43594"

CodePudding user response:

Another possible solution, based on stringr:match and capture groups:

library(stringr)

str_match(strains, "(\\S \\s){2}(\\S ).*")[,3]

#>  [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"   "B1279" 
#>  [9] "subsp." "43594"

CodePudding user response:

The stringr package already has a function that conflicts with the regex.

  • Related