Extraction of substrings from a string on the basis of multiple conditions in R-CodePudding

I have a string str from which multiple substrings are to be extracted.

str <- "Nucleotide transport and metabolism,Secondary metabolites biosynthesis, transport, and catabolism / Chromatin structure and dynamics,Coenzyme metabolism,"

The conditions for extraction are:

Extract everything till the first occurrence of a , only if the next character is a capital letter
If the character next to a , is not a capital letter, then proceed till
- the next occurrence of , which is followed by a capital letter OR
- the occurrence of / OR
- the end of string

The output should look like this

>output
[1] "Nucleotide transport and metabolism"                           "Secondary metabolites biosynthesis, transport, and catabolism"
[3] "Chromatin structure and dynamics"                              "Coenzyme metabolism"

CodePudding user response：

You can use str_split from the stringr package.

library(stringr)

str_split(str, ",(?=[:upper:])|\\s\\/\\s") %>% unlist() %>% gsub(",$", "", .)
[1] "Nucleotide transport and metabolism"                           
[2] "Secondary metabolites biosynthesis, transport, and catabolism"
[3] "Chromatin structure and dynamics"                             
[4] "Coenzyme metabolism,"