Basically title, my html looks like this:
<th data-stat='foo'> 10 </th>
<th data-stat='bar'> 20 </th>
<th data-stat='DUMMY'> </th>
and I tried using
x = [td.getText() for td in rows[i].findAll('td') and not rows[i].findAll(attrs={"data-stat":"DUMMY"})]
but that did not work obviously. My desired output would only get the text from data-stat="foo"
and data-stat="bar"
, which would look like:
x=["10","20"]
CodePudding user response:
You can find easily on the documentation
from bs4 import BeautifulSoup
table = """"<th data-stat='foo'> 10 </th>
<th data-stat='bar'> 20 </th>
<th data-stat='DUMMY'> </th>"""
soup = BeautifulSoup(table, "lxml")
value_list = []
value_list.append(soup.find("th", {"data-stat": "foo"}).text.strip())
value_list.append(soup.find("th", {"data-stat": "bar"}).text.strip())
print(value_list)
CodePudding user response:
Use an css selector
with pseudo-class :not()
to select your elements:
soup.select('th:not([data-stat="DUMMY"])')
Note: In your question you try to find td
while there is only th
in your example.
Example
from bs4 import BeautifulSoup
html ='''
<th data-stat='foo'> 10 </th>
<th data-stat='bar'> 20 </th>
<th data-stat='DUMMY'> </th>
'''
soup = BeautifulSoup(html)
[e.get_text(strip=True) for e in soup.select('th:not([data-stat="DUMMY"])')]
Output
['10', '20']