I'm editing a messy reference list. I'd like to extract the string between the year and the next period. Original text:
[1] "Acemoglu, D., & Robinson, J. A. (2012). Why nations fail: The origins of power, prosperity, and poverty. Crown Books."
[2] "Adam, S., & Kriesi, H. (2007). The network approach. In Sabatier, P. A. (ed.), Theories of the policy process (2nd Ed.). Cambridge, MA: Westview Press."
[3] "Adams-Webber, J. R. (1969). Cognitive complexity and sociality. British Journal of Social and Clinical Psychology, 8, 211-216."
I'd like to extract the following:
[1] "Why nations fail: The origins of power, prosperity, and poverty."
[2] "The network approach."
[3] "Cognitive complexity and sociality."
I'm using the following code
str_extract(df1$References, pattern = "(?<=\\).).*(?=\\.)")
And the text extracted did not stop after the first "." It returns:
1] " Why nations fail: The origins of power, prosperity, and poverty. Crown Books"
[2] " The network approach. In Sabatier, P. A. (ed.), Theories of the policy process (2nd Ed.). Cambridge, MA: Westview Press"
[3] " Cognitive complexity and sociality. British Journal of Social and Clinical Psychology, 8, 211-216"
CodePudding user response:
Consider using the regex pattern that matches one or more characters that are not a dot ([.]
) followed by a \\
. and this succeeds the )
, .
and a space (\\s
) - wrapped within the regex lookaround
library(stringr)
library(tibble)
str_extract(df1$References, "(?<=\\)\\.\\s)[^.] \\.")
[1] "Why nations fail: The origins of power, prosperity, and poverty." "The network approach."
[3] "Cognitive complexity and sociality."
data
df1 <- structure(list(References = c("Acemoglu, D., & Robinson, J. A. (2012). Why nations fail: The origins of power, prosperity, and poverty. Crown Books.",
"Adam, S., & Kriesi, H. (2007). The network approach. In Sabatier, P. A. (ed.), Theories of the policy process (2nd Ed.). Cambridge, MA: Westview Press.",
"Adams-Webber, J. R. (1969). Cognitive complexity and sociality. British Journal of Social and Clinical Psychology, 8, 211-216."
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L))
CodePudding user response:
Base R option with sub
. Extract the text that comes after (number)
till the next full stop.
x <- c("Acemoglu, D., & Robinson, J. A. (2012). Why nations fail: The origins of power, prosperity, and poverty. Crown Books.",
"Adam, S., & Kriesi, H. (2007). The network approach. In Sabatier, P. A. (ed.), Theories of the policy process (2nd Ed.). Cambridge, MA: Westview Press.",
"Adams-Webber, J. R. (1969). Cognitive complexity and sociality. British Journal of Social and Clinical Psychology, 8, 211-216.")
sub('.*?\\(\\d \\)\\.\\s*(.*?)\\..*', '\\1', x)
#[1] "Why nations fail: The origins of power, prosperity, and poverty"
#[2] "The network approach"
#[3] "Cognitive complexity and sociality"