I have text lines such as follows:
() \\span{figurato} di \\span{qualcuno} scream loudly
from which I need to capture the annotations "figurato", "qualcuno" and also the "scream loudly" string. In other words I need to capture each term comprised in curly braces (called annotations, in variable number from 1 to N) but also one string with whatever follows the last closing curly braces.
I have the regex that works well with the first task:
{(?P<annotation>. ?)}
I also have a regex for the second task:
[^}] $
The current python code that works is:
def _scanGloss(gloss: str) -> dict:
return {"gloss": re.search(r"[^}] $", gloss), "annotations": re.findall(r"{(?P<annotation>. ?)}", gloss)}
where gloss is the input line but I'm not succeeding to find a way to do all of this in just one regex. Is it possible?
As a side issue with the second pattern I'm not able to use the parenthesis to define a capture group, but this is less important.
Thank you
CodePudding user response:
- We find
- First named group annotation
(?P<annotation>[^}] )
(any character except}
) - Skip spaces
- Second named group gloss
(?P<gloss>(?:\s*[\w] )*)
(here we are looking for spaces plus words) no space at the end
import re
str='\\span{figurato} di \\span{qualcuno} scream loudly'
regex=re.compile(r"\\span{(?P<annotation>[^}] )}\s (?P<gloss>(?:\s*[\w] )*)")
[m.groupdict() for m in regex.finditer(str)]
[ {'annotation': 'figurato', 'gloss': 'di'},
{'annotation': 'qualcuno', 'gloss': 'scream loudly'} ]