Home > Enterprise >  Extract string between pattern (",") if the string contains a substring in r
Extract string between pattern (",") if the string contains a substring in r


I am trying to extract information from text (bibliographies) and would like to extract a part of a string if it contains certain words.

string1 <- "Authors, (2000). Norway: A Governing Complex Education Systems Case Study, Education Working Paper No. 97."
string2 <- "Authors (2012), Tertiary Education Finance: A Comparative Evaluation of Funding Allocation Mechanisms, The World Bank Education Working Paper Series, No. 4, Washington, D.C."
string3 <- "Something completely else, without the important part, and then some more.

I would like to extract the part between the commas, if it contains the string "Working Paper". So in case of string1 and string2:

result1 <- "Education Working Paper"
result2 <- "The World Bank Education Working Paper Series"

If the first string is longer and contains the No. 97, that is ok, that is easy to remove. I there a way of taking the "Working Paper" string and looking before until the "," and after until a ","?

CodePudding user response:

Here's another regex, slightly more parsimonious:

str_extract(string, '[^,] Working Paper[^,] ')
[1] " Education Working Paper No. 97."              
[2] " The World Bank Education Working Paper Series"
[3] NA 

This works by matching the literal string Working Paper as well as any characters both preceding and following it that are not commas ([^,] , a negative character class).

The output still contains a leading white space. To get rid of it, we can use a negative look-ahead, (?!\\s), which excludes the white space character from being excluded through the negative character class:

str_extract(string, '(?!\\s)[^,] Working Paper[^,] ')
[1] "Education Working Paper No. 97."              
[2] "The World Bank Education Working Paper Series"
[3] NA


string <- c(string1, string2, string3)

CodePudding user response:

Hope this works for you.




  df[str_detect(df, "Working Paper")]  


CodePudding user response:

You may try to capture the text that starts with a comma, has "Working Paper" in it and ends either with a comma or a full stop.


string <- c(string1, string2, string3)
str_extract(string, '.*,\\s*(.*Working Paper.*?)(,|\\.)', group = 1)

#[1] "Education Working Paper No"               
#[2] "The World Bank Education Working Paper Series"
#[3] NA 

Note that I am using stringr version 1.5.0 which has group parameter. If you are using an older version you may use str_match with same regex.

str_match(string, '.*,\\s*(.*Working Paper.*?)(,|\\.)')[, 2]
  • Related