Home > OS >  Extraction of substrings from a string on the basis of multiple conditions in R
Extraction of substrings from a string on the basis of multiple conditions in R

Time:02-15

I have a string str from which multiple substrings are to be extracted.

str <- "Nucleotide transport and metabolism,Secondary metabolites biosynthesis, transport, and catabolism / Chromatin structure and dynamics,Coenzyme metabolism,"

The conditions for extraction are:

  • Extract everything till the first occurrence of a , only if the next character is a capital letter
  • If the character next to a , is not a capital letter, then proceed till
    • the next occurrence of , which is followed by a capital letter OR
    • the occurrence of / OR
    • the end of string

The output should look like this

>output
[1] "Nucleotide transport and metabolism"                           "Secondary metabolites biosynthesis, transport, and catabolism"
[3] "Chromatin structure and dynamics"                              "Coenzyme metabolism"                                          

CodePudding user response:

You can use str_split from the stringr package.

library(stringr)

str_split(str, ",(?=[:upper:])|\\s\\/\\s") %>% unlist() %>% gsub(",$", "", .)
[1] "Nucleotide transport and metabolism"                           
[2] "Secondary metabolites biosynthesis, transport, and catabolism"
[3] "Chromatin structure and dynamics"                             
[4] "Coenzyme metabolism,"   
  • Related