I have a txt file f, which contains specific tagged spans in-between the <a>
and </a>
tags.
For example:
<a>George</a> and his <a>friends</a>, came back <a>home</a>.
I match these using re's finditer
, but I would like to be able to calculate their indexes if the tags were not part of the text.
For example if: George and his friends, came back home.
What I did was
import re
text = "<a>George</a> and his <a>friends</a>, came back <a>home</a>"
tags = re.finditer('(?<=<a>).*?(?=</a>)',f)
to get the tags and the start and end of the span, and then
opening = re.finditer(r"<a>", f)
opening = [(i.group(0), i.start(), i.end()) for i in opening]
closing = re.finditer(r"</a>", f)
closing = [(i.group(0), i.start(), i.end()) for i in closing]
to also obtain the indexes of the opening and closing tags of the spans.
How could I go about calculating the indexes of the spans if the tags were not part of the text? I initially thought of subtracting -3 from the start and end of the span respectively, but for the next span, that would not work since I would need to subtract -3 the distance between the closing tag and the opening tag (I think).
I cannot get the spans and then look for them in a "clean" text, because their position is specific, and I do not want to match multiple occurrences of the same word.
CodePudding user response:
Count the number of tags processed and subtract the lengths accordingly.
import re
text = "<a>George</a> and his <a>friends</a>, came back <a>home</a>"
tags = re.finditer('(?<=<a>).*?(?=</a>)',text)
num_tags = 0
results = []
for tag in tags:
start_idx = tag.start() - 7*num_tags - 3
end_idx = tag.end() - 7*num_tags - 4
num_tags = 1
results.append((tag.group(0),start_idx, end_idx))
CodePudding user response:
Use a variable to hold the running offset from the start/end positions returned by the regexp, incrementing it by 7 for each match.
text = "<a>George</a> and his <a>friends</a>, came back <a>home</a>"
offset = 0
tags = []
for match in re.finditer('(?<=<a>).*?(?=</a>)',text):
start = match.start() - 3 - offset
end = match.end() - 3 - offset
offset = 7
tags.append((match.group(0), start, end))