I need to match and apture the information between the pairs of tags. There are 2 pairs of tags per line. A pair of tags is like this:
<a> </a> <b>hello hello 123</b> stuff to ignore here <i>123412bhje</i> <a>what???</a> stuff to ignore here <b>asd13asf</b> <i>who! Hooooo!</i> stuff to ignore here <i>df7887a</i>
The expected output is:
hello hello 123 123412bhje
what??? asd13asf
who! Hooooo! df7887a
I need to specifically use the format:
M = re.findall(“”, linein)
CodePudding user response:
In order to ignore the first <a> </a>
tag, the regex had to make the assumption that the first character inside of the tag did not contain a space, but the space was allowed thereafter.
Here are the other assumptions made:
- tag letters are in lowercase. eg
<b> </b> <i> </i>
- information between tag-pairs can only contain
uppercase letters
,lowercase letters
,numbers
, and the symbols! and ?
. If there are other symbols within the tags, then it may not match accurately.
Here is a working version based on your example:
import re
linein = '<a> </a> <b>hello hello 123</b> stuff to ignore here <i>123412bhje</i> <a>what???</a> stuff to ignore here <b>asd13asf</b> <i>who! Hooooo!</i> stuff to ignore here <i>df7887a</i>'
M = re.findall(r'<[a-z] >([A-Za-z0-9?!][[A-Za-z0-9?!\s]*)</[a-z]>', linein)
for i in range(0,len(M),2):
print(M[i],M[i 1])
OUTPUT:
hello hello 123 123412bhje
what??? asd13asf
who! Hooooo! df7887a