Home > Mobile >  Beautiful soup not identifying children of an element
Beautiful soup not identifying children of an element

Time:12-22

I am trying to scrape this webpage. I am interested in scraping the text under DIV CLASS="example".

This is the the snippet of the script I am interested in (Stackoverflow automatically banned my post when I tried to post the code, lol): snapshot of the sourcecode

I tried using the find function from beautifulsoup. The code I used was:

testurl = "https://www.snopes.com/fact-check/dark-profits/"
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0'
HEADERS = {'User-agent':user_agent}
req = urllib.request.Request(testurl, headers = HEADERS) # visit disguised as browser
pagehtml = urllib.request.urlopen(req).read() # read the website
pagesoup = soup(pagehtml,'html.parser')

potentials = pagesoup.findAll("div", { "class" : "example" })
potentials[0]
potentials[0].find_children

potentials[0].find_children was not able to find anything. I have also tried potentials[0].findChildren() and it was notable to find anything either. Why is find_children not picking up the children of the div tag?

CodePudding user response:

Try to change the parser from html.parser to html5lib:

import requests
from bs4 import BeautifulSoup

url = "https://www.snopes.com/fact-check/dark-profits/"

soup = BeautifulSoup(requests.get(url).content, "html5lib")
print(soup.select_one(".example").get_text(strip=True, separator="\n"))

Prints:

Welcome to the site www.darkprofits.com, it's us again, now we extended our offerings, here is a list:

...and so on.
  • Related