Pattern Matching Tags with RegEx and Python (re.findall)-CodePudding

I need to match and apture the information between the pairs of tags. There are 2 pairs of tags per line. A pair of tags is like this:

<a> </a> <b>hello hello 123</b> stuff to ignore here <i>123412bhje</i> <a>what???</a> stuff to ignore here <b>asd13asf</b> <i>who! Hooooo!</i> stuff to ignore here <i>df7887a</i>

The expected output is:

hello hello 123 123412bhje 
what??? asd13asf 
who! Hooooo! df7887a

I need to specifically use the format:

M = re.findall(“”, linein)

CodePudding user response：

In order to ignore the first <a> </a> tag, the regex had to make the assumption that the first character inside of the tag did not contain a space, but the space was allowed thereafter.

Here are the other assumptions made:

tag letters are in lowercase. eg <b> </b> <i> </i>
information between tag-pairs can only contain uppercase letters, lowercase letters, numbers, and the symbols ! and ?. If there are other symbols within the tags, then it may not match accurately.

Here is a working version based on your example:

import re

linein = '<a> </a> <b>hello hello 123</b> stuff to ignore here <i>123412bhje</i> <a>what???</a> stuff to ignore here <b>asd13asf</b> <i>who! Hooooo!</i> stuff to ignore here <i>df7887a</i>'
M = re.findall(r'<[a-z] >([A-Za-z0-9?!][[A-Za-z0-9?!\s]*)</[a-z]>', linein)

for i in range(0,len(M),2):
    print(M[i],M[i 1])

OUTPUT:

hello hello 123 123412bhje
what??? asd13asf
who! Hooooo! df7887a