Home > database >  How to use a regex to find all curly-braces inside curly-braces?
How to use a regex to find all curly-braces inside curly-braces?

Time:04-28

I'm using Zotero to create a BibTeX list of references from PDFs, and it uses { } around words whose case must be preserved.

title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil}},

However, some people in my team use Mendeley, which doesn't seem to know about this rule of BibTeX format, and the { } still appear in their titles after importing from the BibTeX file I've sent.

So I want to write a small script (in R) to remove the { } inside the main { } of the title (and other fields), so that the above line would, in the modified file, become as below.

title = {Novel breeding habitat, oviposition microhabitat, and parental care in Bokermannohyla caramaschii (Anura: Hylidae) in southeastern Brazil},

I've tried a lot, but nothing works. What is the Regex to do that?

CodePudding user response:

You can convert matches of the regular expression

(?<!^title = ){|}(?!,$)

to empty strings.

Demo

The regular expression can be broken down as follows. (I've shown spaces as character classes containing a space so that they are visible to the reader.)

(?<!            # begin a negative lookbehind
  ^             # match the start of the string 
  title[ ]=[ ]  # match 'title = '
)               # end negative lookbehind
{               # match '{'
|               # or
}               # match '}'
(?!             # begin a negative lookahead
  ,$            # match a comma at the end of the string
)               # end a negative lookahead

CodePudding user response:

Here's a parser that removes just { and }, and only when inside a complete set of { ... }. It doesn't pretend to be fast or efficient, but with reasonable-length strings, you shouldn't notice any lag.

func <- function(S) {
  spl <- strsplit(S, "")[[1]]
  out <- character(0)
  inbrace <- 0L
  for (i in seq_along(spl)) {
    ch <- spl[i]
    if (ch == "{") {
      if (inbrace < 1L) out <- c(out, ch)
      inbrace <- inbrace   1L
    } else if (ch == "}") {
      if (inbrace == 0L) {
        stop("unmatched close brace at: ", i)
      } else if (inbrace == 1L) {
        out <- c(out, ch)
      }
      inbrace <- max(0L, inbrace - 1L)
    } else out <- c(out, ch)
  }
  if (inbrace != 0L) stop("finished missing ", inbrace, " close-brace(s)")
  paste(out, collapse = "")
}

Demo:

func('title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil}},')
# [1] "title = {Novel breeding habitat, oviposition microhabitat, and parental care in Bokermannohyla caramaschii (Anura: Hylidae) in southeastern Brazil},"

It tries to be very specific, failing if either an unmatched } occurs or if the input ends while a { remains unmatched.

func('title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil},')
# Error in func("title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil},") : 
#   finished missing 1 close-brace(s)

func('title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla}} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil}},')
# Error in func("title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla}} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil}},") : 
#   unmatched close brace at: 156

CodePudding user response:

Here's a strategy that works if we can be sure that the "%%%" and "###" strings are not going to be present in the titles. First we change the first "{" to "%%%" and the last "}" to "###". Then remove all "{" and "}", and then put the first "{" and last "}" back in.

txt <- "title = {Novel breeding habitat, oviposition microhabitat, and parental care in {Bokermannohyla} caramaschii ({Anura}: {Hylidae}) in southeastern {Brazil}},"
txt2 <- sub("(^[^{] )(\\{)", "\\1%%%", txt) # placeholder for first "{"
txt3 <- sub("(\\})([^}]*$)", "###\\2", txt2) #  "    "     for last "}"
txt4 <- gsub("\\{|\\}", "", txt3) # remove the rest
txt5 <- sub("%%%", "{", tx4) # put the leading and trailing ones back
txt6 <- sub("###", "}", txt5)
txt6
[1] "title = {Novel breeding habitat, oviposition microhabitat, and parental care in Bokermannohyla caramaschii (Anura: Hylidae) in southeastern Brazil},"
  • Related