regex python catch selective content inside curly braces, including curly sublevels
The best explanation is a minimum representative example (as you can see is for .bib for those who know latex..). Here is the representative input raw text:
text = """
@book{book1,
title={tit1},
author={aut1}
}
@article{art2,
title={tit2},
author={aut2}
}
@article{art3,
title={tit3},
author={aut3}
}
"""
and here is my try (I failed..) to extract the content inside curly braces only for @article fields.. note that there are \n jumps inside that also want to gather.
regexpresion = r'\@article\{[.*\n] \}'
result = re.findall(regexpresion, text)
and this is actually what I wanted to obtain,
>>> result
['art2,\ntitle={tit2},\nauthor={aut2}', 'art3,\ntitle={tit3},\nauthor={aut3}']
Many thanks for your experience
CodePudding user response:
You might use a 2 step approach, first matching the parts that start with @article, and then in the second step remove the parts that you don't want in the result.
The pattern to match all the parts:
^@article{.*(?:\n(?!@\w {).*) (?=\n}$)
Explanation
^
Start of string@article{.*
Match@article{
and the rest of the line(?:
Non capture group\n(?!@\w {).*
Match a newline and the rest of the line if it does not start with@
1 word chars and{
)
Close the non capture group and repeat it to match all lines(?=\n}$)
Positive lookahead to assert a newline and}
at the end of the string
See the matches on regex101.
The pattern in the replacement matches either @article{
or (using the pipe char |
) 1 one or more spaces after a newline.
@article{|(?<=\n)[^\S\n]
Example
import re
pattern = r"^@article{.*(?:\n(?!@\w {).*) (?=\n}$)"
s = ("@book{book1,\n"
" title={tit1},\n"
" author={aut1}\n"
"}\n"
"@article{art2,\n"
" title={tit2},\n"
" author={aut2}\n"
"}\n"
"@article{art3,\n"
" title={tit3},\n"
" author={aut3}\n"
"}")
res = [re.sub(r"@article{|(?<=\n)[^\S\n] ", "", m) for m in re.findall(pattern, s, re.M)]
print(res)
Output
['art2,\ntitle={tit2},\nauthor={aut2}', 'art3,\ntitle={tit3},\nauthor={aut3}']
CodePudding user response:
Try this :
results = re.findall(r'{(.*?)}', text)
the output is following :
['tit1', 'aut1', 'tit2', 'aut2', 'tit3', 'aut3']
CodePudding user response:
Here is my solution. I hope someone finds a more elegant solution regexpression
than mine one:
regexpression = r'\@article\{\s \w \=\{.*?\},\n\s \w \=\{.*?\}'
aclaratory breakdown of regexpression
:
r'\@article\{ # catches the article field
\s \w \=\{.*?\},\n # title sub-field
\s \w \=\{.*?\} # author sub-field