Find indices of target words without the surrounding brackets-CodePudding

I want a set of sentences with target words target["text"] surrounded by brackets/braces/parentheses and some are overlapping/nested brackets/braces/parentheses. I want to extract these target words as well as their correct indices in the sentence, without brackets/braces/parentheses. I have managed to do so with the brackets and so on:

[in]: sentence = "{ia} ({fascia} antebrachii). Genom att aponeurosen fäster i armb"
[in]: pattern = r'{[^{}] }|\[[^\[\]] \]|\([^\(\)] \)'
[in]: targets = [
        {
             "start": m.start(0),
             "end": m.end(0),
             "text": sentence[m.start(0) : m.end(0)],
         }
         for m in re.finditer(pattern, sentence, overlapped=True)
         ]
[in]: targets
[out]: [{'start': 0, 'end': 4, 'text': '{ia}'},
        {'start': 5, 'end': 27, 'text': '({fascia} antebrachii)'},
        {'start': 6, 'end': 14, 'text': '{fascia}'}]

Now I want to remove the brackets/braces/parentheses from the target["text"]s and find the correct indices of these targets (w/o the brackets/braces/parentheses). Because of the overlapping brackets etc, I am having trouble identifying the correct indices. This code below works with non-overlapped brackets:

[in]: [
          {
              "start": targets[i]["start"]-2*(i) if targets[i]["start"] > 0 else 0,
              "end": targets[i]["end"]-2*(i 1),
              "text": re.sub(r"[\[{(\]})]", "", targets[i]["text"]),
          }
          for i in range(len(targets))
      ]
[out]: [{'start': 0, 'end': 2, 'text': 'ia'},
        {'start': 3, 'end': 23, 'text': 'fascia antebrachii'},
        {'start': 2, 'end': 8, 'text': 'fascia'}]

What would be the recommended approach here? Thanks!

Expected output:

[{'start': 0, 'end': 2, 'text': 'ia'},
 {'start': 3, 'end': 21, 'text': 'fascia antebrachii'},
 {'start': 3, 'end': 9, 'text': 'fascia'}]

CodePudding user response：

Given your sentence and your pattern:

sentence = "{ia} ({fascia} antebrachii). Genom att aponeurosen fäster i armb"
pattern = r'{[^{}] }|\[[^\[\]] \]|\([^\(\)] \)'

and given that your delimiters are braces, brackets and parentheses.

You can do the following:

# extract your matches from the sentence
matches = re.findall(pattern, sentence, overlapped=True)

# clean the matches from the delimiters
words = [re.sub(r'[{}\[\]\(\)]', '', m) for m in matches]

# clean your sentence from the delimiters
clean_sent = re.sub(r'[{}\[\]\(\)]', '', sentence)

# searches the clean words in the clean string 
targets = [{
    "start": m.start(2),
    "end": m.end(2),
    "text": clean_sent[m.start(2) : m.end(2)],
} for m in map(lambda word: re.search(f'(^|[^\w] )({word})($|[^\w] )', clean_sent), words)]

Side note on the last pattern search (^|[^\w] )({word})($|[^\w] ). It checks for words ({word}) that are found:

after the begin delimiter or anything other than letters (^|[^\w] )
before the end delimiter or anything other than letters ($|[^\w] )

The match.start and match.end function have "2" as input since we want to retrieve the start and end index of the second group.

Does this solution help you?

EDIT: How to handle the case when words are near delimiters during sentence cleaning?

You can handle that edge cases by adding one space between delimiters and words before removing the delimiters.

# clean your sentence from the delimiters
clean_sent = re.sub(r'(\w)([\(\[{])', '\\1 \\2', clean_sent)
clean_sent = re.sub(r'([\)\]}])(\w)', '\\1 \\2', clean_sent)
clean_sent = re.sub(r'[{}\[\]\(\)]' , ''       , clean_sent)

The first regex will match all delimiters preceeded by a letter, and replace it with the letter delimiter separated by a space, using backreferencing.

The second regex will match all delimiters followed by a letter, and replace it with the delimiter letter separated by a space, using backreferencing.

The third regex was taken directly from the answer snippet.