Home > Back-end >  R command to extract text between two strings containing curly parentheses
R command to extract text between two strings containing curly parentheses

Time:03-20

I am trying to use the R str_match function from the stringr library to extract the title in bibliographical entries like the following. Indeed, I need to extract the text between the
"title={" and the "}," strings.

a2
[1] "@article{2020, title={Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR‐421 and E‐cadherin}, volume={9}, ISSN={2045-7634}, url={http://dx.doi.org/10.1002/cam4.3002}, DOI={10.1002/cam4.3002}, number={11}, journal={Cancer Medicine}, publisher={Wiley}, author={Ji, Yefeng and Feng, Guanying and Hou, Yunwen and Yu, Yang and Wang, Ruixia and Yuan, Hua}, year={2020}, month={Apr}, pages={3954–3963} }"

I have used approaches like the following, but I get an error message:

str_match(a2, "(?s)title={\\s*(.*?)\\s*},.")

Error in stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) :
Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=(?s)title={\s*(.*?)\s*},.)

I guess the problem is with the matching of the curly parentheses, but I couldn't make any progress. Any pointer would be greatly appreciated.

CodePudding user response:

Use the following regex.

a2 <- "@article{2020, title={Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin}, volume={9}, ISSN={2045-7634}, url={http://dx.doi.org/10.1002/cam4.3002}, DOI={10.1002/cam4.3002}, number={11}, journal={Cancer Medicine}, publisher={Wiley}, author={Ji, Yefeng and Feng, Guanying and Hou, Yunwen and Yu, Yang and Wang, Ruixia and Yuan, Hua}, year={2020}, month={Apr}, pages={3954–3963} }"

sub("^.*title=\\{([^{}] )\\}.*$", "\\1", a2)
#> [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin"

Created on 2022-03-19 by the reprex package (v2.0.1)


Edit

Alternative stringr way.

stringr::str_match(a2, "^.*title=\\{([^{}] )\\}.*$")[,2]
#> [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin"

Created on 2022-03-19 by the reprex package (v2.0.1)

CodePudding user response:

Another possible solution, based on stringr::str_extract:

library(tidyverse)

a2 %>% 
  str_extract("(?<=title\\=\\{)[^\\}]*(?=\\},)")

#> [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR‐421 and E‐cadherin"

CodePudding user response:

Since you want to parse a bibtex file, what you can do is to use bib2df::bib2df, with reference.bib being your bibtex file.

install.packages("bib2df")
library(bib2df)

bib2df("reference.bib")$TITLE..LONG
# [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin"
  • Related