Extracting in-text citations (character strings) from text in R-CodePudding

I'm trying to write a function that would allow me to paste written text, and it would return a list of the in-text citations that were used in the writing. For example, this is what I currently have:

pull_cites<- function (text){
gsub("[\\(\\)]", "", regmatches(text, gregexpr("\\(.*?\\)", text))[[1]])
    }
    
pull_cites("This is a test. I only want to select the (cites) in parenthesis. I do not want it to return words in 
    parenthesis that do not have years attached, such as abbreviations (abbr). For example, citing (Smith 2010) is 
    something I would want to be returned. I would also want multiple citations returned separately such as 
    (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned as Cooper 2015, and not just 2015.")

And in this example, it returns

[1] "cites"                              "abbr"                               "Smith 2010"                        
[4] "Smith 2010; Jones 2001; Brown 2020" "2015"

But I would want it to return something like:

[1] "Smith 2010"
[2] "Smith 2010"                
[3] "Jones 2001"
[4] "Brown 2020"
[5] "Cooper 2015"

Any ideas on how to make this function more specific? I am using R. Thanks!

CodePudding user response：

With some not-so-difficult regex, we can do the following:

library(tidyverse)

pull_cites <- function (text) {
  str_extract_all(text, "(?<=\\()[A-Z][a-z][^()]* [12][0-9]{3}(?=\\))|[A-Z][a-z]  \\([12][0-9]{3}[^()]*", simplify = T) %>% 
    gsub("\\(", "", .) %>% 
    str_split(., "; ") %>% 
    unlist()
}

pull_cites("This is a test. I only want to select the (cites) in parenthesis. 
            I do not want it to return words in parenthesis that do not have years attached, 
            such as abbreviations (abbr). For example, citing (Smith 2010) is something I would 
            want to be returned. I would also want multiple citations returned separately such 
            as (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned 
            as Cooper 2015, and not just 2015. other aspects of life 
            history (Nye et al. 2010; Runge et al. 2010; Lesser 2016). In the Gulf of Maine, 
            annual sea surface temperature (SST) averages have increased a total of roughly 1.6 °C 
            since 1895 (Fernandez et al. 2020)")

[1] "Smith 2010"            "Smith 2010"           
[3] "Jones 2001"            "Brown 2020"           
[5] "Cooper 2015"           "Nye et al. 2010"      
[7] "Runge et al. 2010"     "Lesser 2016"          
[9] "Fernandez et al. 2020"

Regex explanation within str_extract_all():

(?<=\\() matches one character after open bracket ( (double escape \\ in R)
[A-Z][a-z][^()]* matches one capital letter followed by one lower case letter followed by one or more character(s) that is not brackets ([^()*] is contributed by @WiktorStribiżew)
(?=\\)) matches one character before a closing bracket )
[12][0-9]{3} matches year, where I assume year would start with either 1 or 2 and followed by 3 more digits

The following regex is to match the special case with pattern Copper (2015):

[A-Z][a-z] \\([12][0-9]{3}[^()]* matches anything that has a capital letter followed by more than 1 lower case letter(s) followed by an empty space followed by an open bracket ( followed by "year" that I defined above

CodePudding user response：

You can also use

x <- "This is a test. I only want to select the (cites) in parenthesis. I do not want it to return words in parenthesis that do not have years attached, such as abbreviations (abbr). For example, citing (Smith 2010) is something I would want to be returned. I would also want multiple citations returned separately such as (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned as Cooper 2015, and not just 2015."
rx <- "(?:\\b(\\p{Lu}\\w*(?:\\s \\p{Lu}\\w*)*))?\\s*\\(([^()]*\\d{4})\\)"
library(stringr)
res <- str_match_all(x, rx)
result <- lapply(res, function(z) {ifelse(!is.na(z[,2]) & str_detect(z[,3],"^\\d $"), paste(trimws(z[,2]),  trimws(z[,3])), z[,3])})    
unlist(sapply(result, function(z) strsplit(paste(z, collapse=";"), "\\s*;\\s*")))
## -> [1] "Smith 2010"  "Smith 2010"  "Jones 2001"  "Brown 2020"  "Cooper 2015"

See the R demo and the regex demo.

The regex matches

(?:\b(\p{Lu}\w*(?:\s \p{Lu}\w*)*))? - an optional sequence of
- \b - a word boundary
- (\p{Lu}\w*(?:\s \p{Lu}\w*)*) - Group 1: an uppercase letter followed with zero or more word chars, and then zero or more sequences of one or more whitespaces and then an uppercase letter followed with zero or more word chars
\s* - zero or more whitespaces
\( - a ( char
([^()]*\d{4}) - Group 2: any zero or more chars other than ( and ) and then four digits
\) - a ) char.

The str_match_all(x, rx) function finds all matches and keeps the captured substrings. Then, the Group 2 and 3 values are concatenated if Group 2 is not NA and Group 3 is all digits, else, the match is used as is. Later, the items in the res variable are joined with a ; char and split with ; (enclosed with any zero or more whitespaces).