html = """
<html>
<h2>Top Single Name</h2>
<table>
<tr>
<p>hello</p>
</tr>
</table>
<div>
<div>
<h2>Price Return</h2>
</div>
</div>
</html>
"""
When I Use below code
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, 'html.parser')
soup.find_all(['p', 'li', 'dl', 'tr', 'div', re.compile("^h[1-6]$")])
I am getting output as
[<h2>Top Single Name</h2>,
<tr><p>hello</p></tr>,
<p>hello</p>,
<div>
<div>
<h2>Price Return</h2>
</div>
</div>,
<div>
<h2>Price Return</h2>
</div>,
<h2>Price Return</h2>]
But what I need is below only three elements
[<h2>Top Single Name</h2>,
<tr><p>hello</p></tr>,
<div>
<div>
<h2>Price Return</h2>
</div>
</div>
]
Basically I don't want to extract a specific tag if it is inside another tag, is there any way i can have some mapping like below and use in the code don't extract when the key is inside value
{'re.compile("^h[1-6]$")': 'div', 'div':'div', 'p': 'tr'}
CodePudding user response:
Basically I don't want to extract a specific tag if it is inside another tag
I think the simplest way might be to use find_all
just as you are now, and then filter out the nested tags by checking if they have ancestors/parents in the list
sel = soup.find_all(['p', 'li', 'dl', 'tr', 'div', re.compile("^h[1-6]$")])
sel = [s for s in sel if not [p for p in sel if p in s.parents]]
-- same results as getting tags if their tagName is in a list as long as if none of their parents have one of the listed names:
selTags = ['p', 'li', 'dl', 'tr', 'div'] [f'h{i}' for i in range(1,7)]
sel = soup.find_all(lambda t: t.name in selTags and not t.find_parent(selTags))
but if you want to filter by a map
is there any way i can have some mapping like below and use in the code don't extract when the key is inside value
you could use
parentMap = {'div':'div', 'p': 'tr'}
for i in range(1,7): parentMap[f'h{i}'] = 'div'
# parentMap = {'div': 'div', 'p': 'tr', 'h1': 'div', 'h2': 'div', 'h3': 'div', 'h4': 'div', 'h5': 'div', 'h6': 'div'}
sel = soup.find_all(
lambda t: t.name in
['p', 'li', 'dl', 'tr', 'div'] [f'h{i}' for i in range(1,7)]
and not (
t.name in parentMap and
t.find_parent(parentMap[t.name]) is not None
)
)
In this case, you should get the same results either way, but if your html contained
<p><tr>I am a row in a paragraph</tr></p>
then the first methods will return only the outer <p>
tag whereas the last method will return both the <p>
tag and the inner <tr>
tag [unless you add 'tr': 'p'
to parentMap
].