I am trying to use the R str_match function from the stringr library to extract the title in bibliographical entries like the following. Indeed, I need to extract the text between the
"title={" and the "},"
strings.
a2
[1] "@article{2020, title={Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR‐421 and E‐cadherin}, volume={9}, ISSN={2045-7634}, url={http://dx.doi.org/10.1002/cam4.3002}, DOI={10.1002/cam4.3002}, number={11}, journal={Cancer Medicine}, publisher={Wiley}, author={Ji, Yefeng and Feng, Guanying and Hou, Yunwen and Yu, Yang and Wang, Ruixia and Yuan, Hua}, year={2020}, month={Apr}, pages={3954–3963} }"
I have used approaches like the following, but I get an error message:
str_match(a2, "(?s)title={\\s*(.*?)\\s*},.")
Error in stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) :
Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=(?s)title={\s*(.*?)\s*},.
)
I guess the problem is with the matching of the curly parentheses, but I couldn't make any progress. Any pointer would be greatly appreciated.
CodePudding user response:
Use the following regex.
a2 <- "@article{2020, title={Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin}, volume={9}, ISSN={2045-7634}, url={http://dx.doi.org/10.1002/cam4.3002}, DOI={10.1002/cam4.3002}, number={11}, journal={Cancer Medicine}, publisher={Wiley}, author={Ji, Yefeng and Feng, Guanying and Hou, Yunwen and Yu, Yang and Wang, Ruixia and Yuan, Hua}, year={2020}, month={Apr}, pages={3954–3963} }"
sub("^.*title=\\{([^{}] )\\}.*$", "\\1", a2)
#> [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin"
Created on 2022-03-19 by the reprex package (v2.0.1)
Edit
Alternative stringr
way.
stringr::str_match(a2, "^.*title=\\{([^{}] )\\}.*$")[,2]
#> [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin"
Created on 2022-03-19 by the reprex package (v2.0.1)
CodePudding user response:
Another possible solution, based on stringr::str_extract
:
library(tidyverse)
a2 %>%
str_extract("(?<=title\\=\\{)[^\\}]*(?=\\},)")
#> [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR‐421 and E‐cadherin"
CodePudding user response:
Since you want to parse a bibtex file, what you can do is to use bib2df::bib2df
, with reference.bib
being your bibtex file.
install.packages("bib2df")
library(bib2df)
bib2df("reference.bib")$TITLE..LONG
# [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin"