Home > Net >  regex python catch selective content inside curly braces, including curly sublevels and \n chars
regex python catch selective content inside curly braces, including curly sublevels and \n chars


regex python catch selective content inside curly braces, including curly sublevels

The best explanation is a minimum representative example (as you can see is for .bib for those who know latex..). Here is the representative input raw text:

text = """

and here is my try (I failed..) to extract the content inside curly braces only for @article fields.. note that there are \n jumps inside that also want to gather.

regexpresion = r'\@article\{[.*\n] \}'
result       = re.findall(regexpresion, text)

and this is actually what I wanted to obtain,

>>> result
['art2,\ntitle={tit2},\nauthor={aut2}', 'art3,\ntitle={tit3},\nauthor={aut3}']

Many thanks for your experience

CodePudding user response:

You might use a 2 step approach, first matching the parts that start with @article, and then in the second step remove the parts that you don't want in the result.

The pattern to match all the parts:

^@article{.*(?:\n(?!@\w {).*) (?=\n}$)


  • ^ Start of string
  • @article{.* Match @article{ and the rest of the line
  • (?: Non capture group
    • \n(?!@\w {).* Match a newline and the rest of the line if it does not start with @ 1 word chars and {
  • ) Close the non capture group and repeat it to match all lines
  • (?=\n}$) Positive lookahead to assert a newline and } at the end of the string

See the matches on regex101.

The pattern in the replacement matches either @article{ or (using the pipe char |) 1 one or more spaces after a newline.



import re

pattern = r"^@article{.*(?:\n(?!@\w {).*) (?=\n}$)"

s = ("@book{book1,\n"
            "  title={tit1},\n"
            "  author={aut1}\n"
            "  title={tit2},\n"
            "  author={aut2}\n"
            "  title={tit3},\n"
            "  author={aut3}\n"

res = [re.sub(r"@article{|(?<=\n)[^\S\n] ", "", m) for m in re.findall(pattern, s, re.M)]


['art2,\ntitle={tit2},\nauthor={aut2}', 'art3,\ntitle={tit3},\nauthor={aut3}']

CodePudding user response:

Try this :

results = re.findall(r'{(.*?)}', text)

the output is following :

['tit1', 'aut1', 'tit2', 'aut2', 'tit3', 'aut3']

CodePudding user response:

Here is my solution. I hope someone finds a more elegant solution regexpression than mine one:

regexpression = r'\@article\{\s \w \=\{.*?\},\n\s \w \=\{.*?\}'

aclaratory breakdown of regexpression:

r'\@article\{       # catches the article field
\s \w \=\{.*?\},\n  # title sub-field
\s \w \=\{.*?\}     # author sub-field
  • Related