Home > Enterprise >  Retrieving all tags with multiple conditions without knowing the names of the respective attributes
Retrieving all tags with multiple conditions without knowing the names of the respective attributes

Time:11-03

I want to find all tags that have attribute values equal to "ATTR1" and "ATTR2" without knowing the corresponding attribute names.

Let's assume I have the following:

page_content = '''<a href="ATTR1">text1</a>
<div  type="ATTR2">text2</div>
<script  id="ATTR2">text3</script>
<span  id="ATTR2">text5</span>'''

I would like to have a script that retrieves only the third element, which has an attribute equal to "ATTR1" AND an attribute equal to "ATTR2". That is, I need the following:

<script  id="ATTR2">text3</script>

I know I can pass a function as an argument to find_all(). But, I need help understanding how I can write a function that returns true if these conditions are met.

CodePudding user response:

Knowing the attribute names, simply chain your conditions e.g. with css selector:

select('#ATTR2.ATTR1')

Or without knowing the attributes and just checking all values against:

for e in soup():
    attr_list = [v for i in list(e.attrs.values()) for v in (i if isinstance(i,list) else [i])]
    if all(x in attr_list for x in ['ATTR1','ATTR2']):
        print(e)

Example

from bs4 import BeautifulSoup

html = '''
<a href="ATTR1">text1</a>
<div  type="ATTR2">text2</div>
<script  id="ATTR2">text3</script>
<span  id="ATTR2">text5</div>'''

soup = BeautifulSoup(html)
print(soup.select('#ATTR2.ATTR1'))

for e in soup():
    attr_list = [v for i in list(e.attrs.values()) for v in (i if isinstance(i,list) else [i])]
    if all(x in attr_list for x in ['ATTR1','ATTR2']):
        print(e)

Output

[<script  id="ATTR2">text3</script>]

[<script  id="ATTR2">text3</script>]
  • Related