regex python catch selective content inside curly braces, including curly sublevels and \n chars-CodePudding

regex python catch selective content inside curly braces, including curly sublevels

The best explanation is a minimum representative example (as you can see is for .bib for those who know latex..). Here is the representative input raw text:

text = """
@book{book1,
  title={tit1},
  author={aut1}
}
@article{art2,
  title={tit2},
  author={aut2}
}
@article{art3,
  title={tit3},
  author={aut3}
}
"""

and here is my try (I failed..) to extract the content inside curly braces only for @article fields.. note that there are \n jumps inside that also want to gather.

regexpresion = r'\@article\{[.*\n] \}'
result       = re.findall(regexpresion, text)

and this is actually what I wanted to obtain,

>>> result
['art2,\ntitle={tit2},\nauthor={aut2}', 'art3,\ntitle={tit3},\nauthor={aut3}']

Many thanks for your experience

CodePudding user response：

You might use a 2 step approach, first matching the parts that start with @article, and then in the second step remove the parts that you don't want in the result.

The pattern to match all the parts:

^@article{.*(?:\n(?!@\w {).*) (?=\n}$)

Explanation

^ Start of string
@article{.* Match @article{ and the rest of the line
(?: Non capture group
- \n(?!@\w {).* Match a newline and the rest of the line if it does not start with @ 1 word chars and {
) Close the non capture group and repeat it to match all lines
(?=\n}$) Positive lookahead to assert a newline and } at the end of the string

See the matches on regex101.

The pattern in the replacement matches either @article{ or (using the pipe char |) 1 one or more spaces after a newline.

@article{|(?<=\n)[^\S\n]

Example

import re

pattern = r"^@article{.*(?:\n(?!@\w {).*) (?=\n}$)"

s = ("@book{book1,\n"
            "  title={tit1},\n"
            "  author={aut1}\n"
            "}\n"
            "@article{art2,\n"
            "  title={tit2},\n"
            "  author={aut2}\n"
            "}\n"
            "@article{art3,\n"
            "  title={tit3},\n"
            "  author={aut3}\n"
            "}")

res = [re.sub(r"@article{|(?<=\n)[^\S\n] ", "", m) for m in re.findall(pattern, s, re.M)]
print(res)

Output

['art2,\ntitle={tit2},\nauthor={aut2}', 'art3,\ntitle={tit3},\nauthor={aut3}']

CodePudding user response：

Try this :

results = re.findall(r'{(.*?)}', text)

the output is following :

['tit1', 'aut1', 'tit2', 'aut2', 'tit3', 'aut3']

CodePudding user response：

Here is my solution. I hope someone finds a more elegant solution regexpression than mine one:

regexpression = r'\@article\{\s \w \=\{.*?\},\n\s \w \=\{.*?\}'

aclaratory breakdown of regexpression:

r'\@article\{       # catches the article field
\s \w \=\{.*?\},\n  # title sub-field
\s \w \=\{.*?\}     # author sub-field