I am trying to extract information from text (bibliographies) and would like to extract a part of a string if it contains certain words.
string1 <- "Authors, (2000). Norway: A Governing Complex Education Systems Case Study, Education Working Paper No. 97."
string2 <- "Authors (2012), Tertiary Education Finance: A Comparative Evaluation of Funding Allocation Mechanisms, The World Bank Education Working Paper Series, No. 4, Washington, D.C."
string3 <- "Something completely else, without the important part, and then some more.
I would like to extract the part between the commas, if it contains the string "Working Paper". So in case of string1 and string2:
result1 <- "Education Working Paper"
result2 <- "The World Bank Education Working Paper Series"
If the first string is longer and contains the No. 97, that is ok, that is easy to remove. I there a way of taking the "Working Paper" string and looking before until the "," and after until a ","?
CodePudding user response:
Here's another regex, slightly more parsimonious:
library(stringr)
str_extract(string, '[^,] Working Paper[^,] ')
[1] " Education Working Paper No. 97."
[2] " The World Bank Education Working Paper Series"
[3] NA
This works by matching the literal string Working Paper
as well as any characters both preceding and following it that are not commas ([^,]
, a negative character class).
The output still contains a leading white space. To get rid of it, we can use a negative look-ahead, (?!\\s)
, which excludes the white space character from being excluded through the negative character class:
str_extract(string, '(?!\\s)[^,] Working Paper[^,] ')
[1] "Education Working Paper No. 97."
[2] "The World Bank Education Working Paper Series"
[3] NA
Data:
string <- c(string1, string2, string3)
CodePudding user response:
Hope this works for you.
x<-list(string1,string2,string3)
library(stringr)
y<-str_split(x,",")
detectwp<-function(df){
df[str_detect(df, "Working Paper")]
}
map(y,detectwp)
CodePudding user response:
You may try to capture the text that starts with a comma, has "Working Paper" in it and ends either with a comma or a full stop.
library(stringr)
string <- c(string1, string2, string3)
str_extract(string, '.*,\\s*(.*Working Paper.*?)(,|\\.)', group = 1)
#[1] "Education Working Paper No"
#[2] "The World Bank Education Working Paper Series"
#[3] NA
Note that I am using stringr
version 1.5.0 which has group
parameter. If you are using an older version you may use str_match
with same regex.
str_match(string, '.*,\\s*(.*Working Paper.*?)(,|\\.)')[, 2]