l am really struggling with this case and have been trying all day. Please l need your help.I am trying to scrape this webpage: https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or= l want to get all 137 href-s (137 documents). The code l used:
def test(self):
final_url = 'https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or='
self.driver.get(final_url)
soup = BeautifulSoup(self.driver.page_source, 'html.parser')
iframes = soup.find('iframe')
src = iframes['src']
base = 'https://decisions.scc-csc.ca/'
main_url = urljoin(base, src)
self.driver.get((main_url))
browser = self.driver
elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 20
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
no_of_pagedowns -= 1
The problem is that it loads only 25 first documents (href) and don't know how to do that.
CodePudding user response:
This code scrolls down until all elements are visible, then save the urls of the pdfs in the list pdfs
. Notice that all the work is done with selenium, without using BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(options=options, service=Service(your_chromedriver_path))
driver.get('https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or=')
# wait for the iframe to be loaded and then switch to it
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "decisia-iframe")))
# in this case number_of_results = 137
number_of_results = int(driver.find_element(By.XPATH, "//h2[contains(., 'result')]").text.split()[0])
pdfs = []
while len(pdfs) < number_of_results:
pdfs = driver.find_elements(By.CSS_SELECTOR, 'a[title="Download the PDF version"]')
# scroll down to the last visible row
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', pdfs[-1])
time.sleep(1)
pdfs = [pdf.get_attribute('href') for pdf in pdfs]