I have a tag like below that I want to select it with Beautiful Soup
<td align="right" class="simcal" valign="top"> Title:<br/></td>
When I try to select this tag with the following codes everything is ok.
# sample 1 :
my_tag = soup.find(
'td',
attrs={"align": "right", "class": "header2", "valign": 'top'},
)
# sample 2 :
my_tag = soup.find(
text=" Title:",
attrs={"align": "right", "class": "header2", "valign": 'top'},
)
But when i try to combine these two together Beautiful Soup
can not find the element which I want.
# This will fail
my_tag = soup.find(
'td',
text=" Title:",
attrs={"align": "right", "class": "header2", "valign": 'top'},
)
So my question here is can someone explain to me what is happening here?
CodePudding user response:
First, there's a typo in this. You have it look for when in your html it's
"simcal"
Secondly, (this is just my understanding, I can't say for certain) but the text " Title:"
is within a <br>
tag with not attributes. So it's correct in that it doesn't return anything with the attributes align="right" valign="top"
as that belongs the the <td>
tag. What is tricky here is that for html, you don't need to open with a <br>
tag, which I think is why BeautifulSoup is getting tripped up here.
Notice, if we remove the </br>
tag, it works:
from bs4 import BeautifulSoup
html = '''<td align="right" valign="top"> Title:</td>'''
soup = BeautifulSoup(html, 'html.parser')
my_tag = soup.find(
'td',
text=" Title:",
attrs={"align": "right", "class": "header2", "valign": 'top'},
)
print(my_tag)
Output:
<td align="right" class="header2" valign="top"> Title:</td>
To fix this in your case though without having to remove closing </br>
tags, and with help from this solution, we see by using a 'lxml'
parser in stead of 'html.parser'
, it can deal with it.
from bs4 import BeautifulSoup
html = '''<td align="right" valign="top"> Title:</br></td>'''
soup = BeautifulSoup(html, 'lxml')
# sample 1 :
my_tag1 = soup.find(
'td',
attrs={"align": "right", "class": "header2", "valign": 'top'},
)
# sample 2 :
my_tag2 = soup.find(
text=" Title:",
attrs={"align": "right", "class": "header2", "valign": 'top'},
)
my_tag3 = soup.find(
'td',
text=" Title:",
attrs={"align": "right", "class": "header2", "valign": 'top'},
)
print(my_tag1)
print(my_tag2)
print(my_tag3)
Output:
<td align="right" class="header2" valign="top"> Title:</td>
<td align="right" class="header2" valign="top"> Title:</td>
<td align="right" class="header2" valign="top"> Title:</td>