I want to parse many html pages and remove a div that contains the text "Message", using beautifulsoup html.parser
and python. The div has no name or id, so pointing to it is not possible. I am able to do this for 1 html page. In the code below, you will see 6 .parent
. This is because there are 5 tags (p,i,b,span,a) between div tag and the text "Message", and 6th tag is div, in this html page. The code below works fine for 1 html page.
soup = BeautifulSoup(html_page,"html.parser")
scores = soup.find_all(text=re.compile('Message'))
divs = [score.parent.parent.parent.parent.parent.parent for score in scores]
divs.decompose()
The problem is - The number of tags between div and "Message" is not always 6. In some html page its 3, and in some 7.
So, is there a way to find the number of tags (n) between the text "Message" and nearest div to the left dynamically, and add n 1 number of .parent to score (in the code above) using python
, beautifulsoup
?
CodePudding user response:
As described in your question, that there is no other <div>
between, you could use .find_parent()
:
soup.find(text=re.compile('Message')).find_parent('div').decompose()
Be aware, that if you use find_all()
you have to iterate your ResultSet
while unsing .find_parent()
:
for r in soup.find_all(text=re.compile('Message')):
r.find_parent('div').decompose()
As in your example divs.decompose()
- You also should iterate the list
.
Example
from bs4 import BeautifulSoup
import re
html='''
<div>
<span>
<i>
<x>Message</x>
</i>
</span>
</div>
'''
soup = BeautifulSoup(html)
soup.find(text=re.compile('Message')).find_parent('div')