Calculate match index of regexp-CodePudding

I have a txt file f, which contains specific tagged spans in-between the <a> and </a> tags.

For example:

<a>George</a> and his <a>friends</a>, came back <a>home</a>.

I match these using re's finditer, but I would like to be able to calculate their indexes if the tags were not part of the text.

For example if: George and his friends, came back home.

What I did was

import re
text = "<a>George</a> and his <a>friends</a>, came back <a>home</a>"
tags = re.finditer('(?<=<a>).*?(?=</a>)',f)

to get the tags and the start and end of the span, and then

opening = re.finditer(r"<a>", f)
opening = [(i.group(0), i.start(), i.end()) for i in opening]

closing = re.finditer(r"</a>", f)
closing = [(i.group(0), i.start(), i.end()) for i in closing]

to also obtain the indexes of the opening and closing tags of the spans.

How could I go about calculating the indexes of the spans if the tags were not part of the text? I initially thought of subtracting -3 from the start and end of the span respectively, but for the next span, that would not work since I would need to subtract -3 the distance between the closing tag and the opening tag (I think).

I cannot get the spans and then look for them in a "clean" text, because their position is specific, and I do not want to match multiple occurrences of the same word.

CodePudding user response：

Count the number of tags processed and subtract the lengths accordingly.

import re
text = "<a>George</a> and his <a>friends</a>, came back <a>home</a>"
tags = re.finditer('(?<=<a>).*?(?=</a>)',text)

num_tags = 0
results = []
for tag in tags:
    start_idx = tag.start() - 7*num_tags - 3 
    end_idx = tag.end() - 7*num_tags - 4
    num_tags  = 1
    results.append((tag.group(0),start_idx, end_idx))

CodePudding user response：

Use a variable to hold the running offset from the start/end positions returned by the regexp, incrementing it by 7 for each match.

text = "<a>George</a> and his <a>friends</a>, came back <a>home</a>"
offset = 0
tags = []
for match in re.finditer('(?<=<a>).*?(?=</a>)',text):
    start = match.start() - 3 - offset
    end = match.end() - 3 - offset
    offset  = 7
    tags.append((match.group(0), start, end))