Extract string between pattern (",") if the string contains a substring in r-CodePudding

I am trying to extract information from text (bibliographies) and would like to extract a part of a string if it contains certain words.

string1 <- "Authors, (2000). Norway: A Governing Complex Education Systems Case Study, Education Working Paper No. 97."
string2 <- "Authors (2012), Tertiary Education Finance: A Comparative Evaluation of Funding Allocation Mechanisms, The World Bank Education Working Paper Series, No. 4, Washington, D.C."
string3 <- "Something completely else, without the important part, and then some more.

I would like to extract the part between the commas, if it contains the string "Working Paper". So in case of string1 and string2:

result1 <- "Education Working Paper"
result2 <- "The World Bank Education Working Paper Series"

If the first string is longer and contains the No. 97, that is ok, that is easy to remove. I there a way of taking the "Working Paper" string and looking before until the "," and after until a ","?

CodePudding user response：

Here's another regex, slightly more parsimonious:

library(stringr)
str_extract(string, '[^,] Working Paper[^,] ')
[1] " Education Working Paper No. 97."              
[2] " The World Bank Education Working Paper Series"
[3] NA

This works by matching the literal string Working Paper as well as any characters both preceding and following it that are not commas ([^,] , a negative character class).

The output still contains a leading white space. To get rid of it, we can use a negative look-ahead, (?!\\s), which excludes the white space character from being excluded through the negative character class:

str_extract(string, '(?!\\s)[^,] Working Paper[^,] ')
[1] "Education Working Paper No. 97."              
[2] "The World Bank Education Working Paper Series"
[3] NA

Data:

string <- c(string1, string2, string3)

CodePudding user response：

Hope this works for you.

x<-list(string1,string2,string3)

library(stringr)

y<-str_split(x,",")

detectwp<-function(df){
  df[str_detect(df, "Working Paper")]  
}

map(y,detectwp)

CodePudding user response：

You may try to capture the text that starts with a comma, has "Working Paper" in it and ends either with a comma or a full stop.

library(stringr)

string <- c(string1, string2, string3)
str_extract(string, '.*,\\s*(.*Working Paper.*?)(,|\\.)', group = 1)

#[1] "Education Working Paper No"               
#[2] "The World Bank Education Working Paper Series"
#[3] NA

Note that I am using stringr version 1.5.0 which has group parameter. If you are using an older version you may use str_match with same regex.

str_match(string, '.*,\\s*(.*Working Paper.*?)(,|\\.)')[, 2]