Regex: capturing all text between multiple curly braces and anything following the last curly brace-CodePudding

I have text lines such as follows:

() \\span{figurato} di \\span{qualcuno} scream loudly

from which I need to capture the annotations "figurato", "qualcuno" and also the "scream loudly" string. In other words I need to capture each term comprised in curly braces (called annotations, in variable number from 1 to N) but also one string with whatever follows the last closing curly braces.

I have the regex that works well with the first task:

{(?P<annotation>. ?)}

I also have a regex for the second task:

[^}] $

The current python code that works is:

def _scanGloss(gloss: str) -> dict:
    return {"gloss": re.search(r"[^}] $", gloss), "annotations": re.findall(r"{(?P<annotation>. ?)}", gloss)}

where gloss is the input line but I'm not succeeding to find a way to do all of this in just one regex. Is it possible?

As a side issue with the second pattern I'm not able to use the parenthesis to define a capture group, but this is less important.

Thank you

CodePudding user response：

Explanation.

We find \\span{
First named group annotation (?P<annotation>[^}] ) (any character except })
Skip spaces \s
Second named group gloss (?P<gloss>(?:\s*[\w] )*) (here we are looking for spaces plus words) no space at the end

import re

str='\\span{figurato} di \\span{qualcuno} scream loudly'

regex=re.compile(r"\\span{(?P<annotation>[^}] )}\s (?P<gloss>(?:\s*[\w] )*)")

[m.groupdict() for m in regex.finditer(str)]

output

[ {'annotation': 'figurato', 'gloss': 'di'}, 
{'annotation': 'qualcuno', 'gloss': 'scream loudly'} ]