Home > Software engineering >  Extract the string between the nth occurrence of two symbols
Extract the string between the nth occurrence of two symbols

Time:10-10

I'm editing a messy reference list. I'd like to extract the string between the year and the next period. Original text:

[1] "Acemoglu, D., & Robinson, J. A. (2012). Why nations fail: The origins of power, prosperity, and poverty. Crown Books."
[2] "Adam, S., & Kriesi, H. (2007). The network approach. In Sabatier, P. A. (ed.), Theories of the policy process (2nd Ed.). Cambridge, MA: Westview Press." [3] "Adams-Webber, J. R. (1969). Cognitive complexity and sociality. British Journal of Social and Clinical Psychology, 8, 211-216."

I'd like to extract the following:

[1] "Why nations fail: The origins of power, prosperity, and poverty."
[2] "The network approach."
[3] "Cognitive complexity and sociality."

I'm using the following code

str_extract(df1$References, pattern = "(?<=\\).).*(?=\\.)")

And the text extracted did not stop after the first "." It returns:

1] " Why nations fail: The origins of power, prosperity, and poverty. Crown Books"
[2] " The network approach. In Sabatier, P. A. (ed.), Theories of the policy process (2nd Ed.). Cambridge, MA: Westview Press"
[3] " Cognitive complexity and sociality. British Journal of Social and Clinical Psychology, 8, 211-216"

CodePudding user response:

Consider using the regex pattern that matches one or more characters that are not a dot ([.] ) followed by a \\. and this succeeds the ), . and a space (\\s) - wrapped within the regex lookaround

library(stringr)
library(tibble)
str_extract(df1$References, "(?<=\\)\\.\\s)[^.] \\.")
[1] "Why nations fail: The origins of power, prosperity, and poverty." "The network approach."                                           
[3] "Cognitive complexity and sociality." 

data

df1 <- structure(list(References = c("Acemoglu, D., & Robinson, J. A. (2012). Why nations fail: The origins of power, prosperity, and poverty. Crown Books.", 
"Adam, S., & Kriesi, H. (2007). The network approach. In Sabatier, P. A. (ed.), Theories of the policy process (2nd Ed.). Cambridge, MA: Westview Press.", 
"Adams-Webber, J. R. (1969). Cognitive complexity and sociality. British Journal of Social and Clinical Psychology, 8, 211-216."
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L))

CodePudding user response:

Base R option with sub. Extract the text that comes after (number) till the next full stop.

x <- c("Acemoglu, D., & Robinson, J. A. (2012). Why nations fail: The origins of power, prosperity, and poverty. Crown Books.", 
       "Adam, S., & Kriesi, H. (2007). The network approach. In Sabatier, P. A. (ed.), Theories of the policy process (2nd Ed.). Cambridge, MA: Westview Press.", 
       "Adams-Webber, J. R. (1969). Cognitive complexity and sociality. British Journal of Social and Clinical Psychology, 8, 211-216.")

sub('.*?\\(\\d \\)\\.\\s*(.*?)\\..*', '\\1', x)

#[1] "Why nations fail: The origins of power, prosperity, and poverty"
#[2] "The network approach"                                           
#[3] "Cognitive complexity and sociality" 
  •  Tags:  
  • r
  • Related