Home > Software design >  How do you use beautifulsoup and selenium to scrape html inside shadow dom?
How do you use beautifulsoup and selenium to scrape html inside shadow dom?

Time:10-31

I'm trying to make an automation program to scrape part of a website. But this website is made out of javascript, and the part of the website I want to scrape is in a shadow dom.

So I figured out that I should use selenium to go to that website and use this code to access elements in shadow dom

def expand_shadow_element(element):
    shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
    return shadow_root

and use

driver.page_source

to get the HTML of that website. But this code doesn't show me elements that are inside the shadow dom.

I've tried combining those two and tried

root1 = driver.find_element(By. CSS_SELECTOR, "path1")
shadow_root = expand_shadow_element(root1)
html = shadow_root.page_source

but I got

AttributeError: 'ShadowRoot' object has no attribute 'page_source'

for a response. So I think that I need to use BeautifulSoup to scrape data from that page, but I can't figure out how to combine BeautifulSoup and Selenium to scrape data from a shadow dom.


P.S. If the part I want to scrape is

<h3>apple</h3>
<p>1$</p>
<p>red</p>

I want to scrape that code exactly, not

apple
1$
red

CodePudding user response:

You would use BeautifulSoup here as follows:

soup = BeautifulSoup(driver.page_source, 'lxml')
my_parts = soup.select('h3')   # for example

CodePudding user response:

Most likely you need to wait for an element to show in the code so you need to set Implicit Wait or Explicit Wait, then once an element is loaded you can soup that page for HTML result.

driver.implicitly_wait(15) #in secounds

  • Related