I need to get the text from first <td>
element of each <tr>
. But not all the text, only the one inside tags <a>
and outside of any other tag. I wrote examples of necessary text as "yyy"/"y" and examples of not necessary as "zzz"
<table>
<tbody>
<tr>
<td>
<b>zzz</b>
<a href="#">yyy</a>
"y"
<a href="#">yyy</a>
<sup>zzz</sup>
<a href="#">yyy</a>
<a href="#">yyy</a>
"y"
</td>
<td>
zzzzz
</td>
</tr>
</tbody>
</table>
Here what I have at the moment
words = []
for tableRows in soup.select("table > tbody > tr"):
tableData = tableRows.find("td").text
text = [word.strip() for word in tableData.split(' ')]
words.append(text)
print(words)
But this code is parsing all the text from <td>
: ["zzz", "yyyy", "yyyy", "zzz", "yyyy"]
.
CodePudding user response:
Try:
from bs4 import BeautifulSoup, Tag, NavigableString
html_doc = """\
<table>
<tbody>
<tr>
<td>
<b>zzz</b>
<a href="#">yyy</a>
"y"
<a href="#">yyy</a>
<sup>zzz</sup>
<a href="#">yyy</a>
<a href="#">yyy</a>
"y"
</td>
<td>
zzzzz
</td>
</tr>
</tbody>
</table>"""
soup = BeautifulSoup(html_doc, "html.parser")
for td in soup.select("td:nth-of-type(1)"):
for c in td.contents:
if isinstance(c, Tag) and c.name == "a":
print(c.text.strip())
elif isinstance(c, NavigableString):
c = c.strip()
if c:
print(c)
Prints:
yyy
"y"
yyy
yyy
yyy
"y"
soup.select("td:nth-of-type(1)")
selects just first<td>
.- then we iterate over
.contents
of this<td>
if isinstance(c, Tag) and c.name == "a"
checks if the content isTag
and the name of theTag
is<a>
if isinstance(c, NavigableString)
checks if the content is plain string.
CodePudding user response:
Based on your example, use the children
of td
tag.
Then check child having name a
of None.
Then check if child having text then append.
words = []
for item in soup.select("table > tbody > tr"):
for child in item.td.children:
if child.name=='a' or child.name==None:
if child.text.strip():
words.append(child.text.strip())
print(words)
Output:
['yyy', '"y"', 'yyy', 'yyy', 'yyy', '"y"']