Home > Back-end >  problem with selecting a tag in BeautifulSoup
problem with selecting a tag in BeautifulSoup

Time:12-02

I have a tag like below that I want to select it with Beautiful Soup

<td align="right" class="simcal" valign="top"> Title:<br/></td>

When I try to select this tag with the following codes everything is ok.

# sample 1 :
my_tag = soup.find(
            'td',
            attrs={"align": "right", "class": "header2", "valign": 'top'},
        )
# sample 2 :
my_tag = soup.find(
            text=" Title:",
            attrs={"align": "right", "class": "header2", "valign": 'top'},
        )

But when i try to combine these two together Beautiful Soup can not find the element which I want.

# This will fail
my_tag = soup.find(
            'td',
            text=" Title:",
            attrs={"align": "right", "class": "header2", "valign": 'top'},
        )

So my question here is can someone explain to me what is happening here?

CodePudding user response:

First, there's a typo in this. You have it look for when in your html it's "simcal"

Secondly, (this is just my understanding, I can't say for certain) but the text " Title:" is within a <br> tag with not attributes. So it's correct in that it doesn't return anything with the attributes align="right" valign="top" as that belongs the the <td> tag. What is tricky here is that for html, you don't need to open with a <br> tag, which I think is why BeautifulSoup is getting tripped up here.

Notice, if we remove the </br> tag, it works:

from bs4 import BeautifulSoup

html = '''<td align="right"  valign="top"> Title:</td>'''

soup = BeautifulSoup(html, 'html.parser')
my_tag = soup.find(
            'td',
            text=" Title:",
            attrs={"align": "right", "class": "header2", "valign": 'top'},
        )

print(my_tag)

Output:

<td align="right" class="header2" valign="top"> Title:</td>

To fix this in your case though without having to remove closing </br> tags, and with help from this solution, we see by using a 'lxml' parser in stead of 'html.parser', it can deal with it.

from bs4 import BeautifulSoup

html = '''<td align="right"  valign="top"> Title:</br></td>'''

soup = BeautifulSoup(html, 'lxml')

# sample 1 :
my_tag1 = soup.find(
            'td',
            attrs={"align": "right", "class": "header2", "valign": 'top'},
        )
# sample 2 :
my_tag2 = soup.find(
            text=" Title:",
            attrs={"align": "right", "class": "header2", "valign": 'top'},
        )

my_tag3 = soup.find(
            'td',
            text=" Title:",
            attrs={"align": "right", "class": "header2", "valign": 'top'},
        )



print(my_tag1)
print(my_tag2)
print(my_tag3)

Output:

<td align="right" class="header2" valign="top"> Title:</td>
<td align="right" class="header2" valign="top"> Title:</td>
<td align="right" class="header2" valign="top"> Title:</td>
  • Related