I have a string
s="<response>blabla
<head> blabla
<t> EXTRACT 1</t>
<t>EXTRACT 2</t>
</head>
<body> blabla
<t>BODY 1</t>
<t>BODY 2</t>
</response>"
I need to extract the text betwen the tags and but only if its in the head part. I tried
regex="(?:<t>([\w.,_]*)*)</t>
re.findall(regex,s)
but it is fetching the body part too , i understand that i need to tell it to stop at the closing head tag but I couldnt come up with any way
PS:The string is in a single line, I split it for better readability.And i want to do this using regex and not xml parsers.
CodePudding user response:
You can find the header first :
s = "<response>blabla <head> blabla <t> EXTRACT 1</t> <t>EXTRACT 2</t> </head> <body> blabla <t>BODY 1</t> <t>BODY 2</t> </response>"
pattern_head = "<head>(.*)</head>"
header = re.findall(pattern_head, s)
print(header)
This gives : [' blabla <t> EXTRACT 1</t> <t>EXTRACT 2</t> ']
Then get what you want from the head :
pattern = "<t>(.*?)</t>"
substring = re.findall(pattern,header[0])
print(substring)
>>> [' EXTRACT 1', 'EXTRACT 2']
CodePudding user response:
I got the solution from @oriberu
regex=<t>(\w )</t>(?=.*?</head>)