How to scrape text from a hidden element?-CodePudding

I am trying to scrape the text of the Swiss constitution from Link and convert it to markdown. However, the page source is different from what I see in the inspector: The source only contains no script warnings in various languages with the element "app-root" hidden.

The inspector shows a .html file served from here with which I am able to get the desired result. However, using this file directly would not allow me to scrape the subsequent revisions of the law automatically. Is there a way to extract the page source with the element "app-root" displayed?

This code returns "None" but works with the URL set to the .html file:

from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver import FirefoxOptions
from bs4 import BeautifulSoup
from markdownify import markdownify

url = "https://www.fedlex.admin.ch/eli/cc/1999/404/en"

opts = FirefoxOptions()
opts.add_argument("--headless")
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), options=opts)
driver.get(url)

html = driver.page_source

soup = BeautifulSoup(html, "html.parser")
div = soup.find("div", {"id": "lawcontent"})

content = markdownify(str(div))

print(content[:200])

Any help is much appreciated.

CodePudding user response：

In your code, you're not giving any time for the driver to render the contents, resulting in incomplete source code.

Waits can be used to wait for required elements to be visible/present etc. The given code below waits for the div content to be visible and then returns the page source code.

Code snippet-

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

url = "https://www.fedlex.admin.ch/eli/cc/1999/404/en"

driver.get(url)
try:
    delay=20 #20 second delay
    WebDriverWait(driver, delay).until(EC.visibility_of_element_located((By.ID, 'lawcontent')))
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")
    div = soup.find("div", {"id": "lawcontent"})
    content = markdownify(str(div))
    print(content[:200])

#raises Exception if element is not visible within delay duration
except TimeoutException:
    print("Timeout!!!")