Home > database >  selenium: scraping a page till all the products loaded
selenium: scraping a page till all the products loaded

Time:08-21

I am new to selenium and trying to work on a project where I need to scrape the URL's from a page.

The source is:- https://www.autofurnish.com/audi-car-accessories

I wanted to scrape the data to get the URLs of these products. I am able to complete it but facing an issue with the scrolling part. I need to scrape all the URLs of all the products on this page. This is a huge page having a lot of results.

What I tried:-

1.

 driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

I tried this code, but it's just scrolling down till the end and all the products aren't loading.

2.

data = driver.find_elements(By.XPATH,"//h2[@class='product-title']//a")
for i in data:
    driver.execute_script("arguments[0].scrollIntoView();", i)
  1. items = [] last_height = driver.execute_script("return document.body.scrollHeight") item_targetcount = 1000 while item_targetcount > len(items): driver.execute_script("window.scrollTo(0,document.body.scrollHeight);") time.sleep(2) # giving time to website to load new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height

Tried to take help from:- How to scroll down in Python Selenium step by step Scrolling to element using webdriver? Tried watching a few youtube videos still unable to fix this.

My main code to scrape other details is:-

prod_details = []
for i in models:
    driver.find_element(By.XPATH,"//span[@aria-labelledby='select2-brand-container']").click()
    time.sleep(2)
    driver.find_element(By.XPATH,"//input[@class='select2-search__field']").send_keys(i)
    driver.find_element(By.XPATH,"//input[@class='select2-search__field']").send_keys(Keys.ENTER)
    driver.find_element(By.XPATH,"//div[@class='btnred sbv-link sbv-inactive']").click()
    time.sleep(3)
    prod = driver.find_elements(By.XPATH,"//h2[@class='product-title']//a")
    for i in prod:
        prod_details.append(i.get_attribute("href"))
    driver.get('https://www.autofurnish.com/')
    time.sleep(2)

Still unable to load the page completely and get all the outputs.

CodePudding user response:

To extract the value of the href attribute from the elements you can use list comprehension and you can use either of the following locator strategies:

  • Using CSS_SELECTOR:

    driver.get('https://www.autofurnish.com/audi-car-accessories#/pageSize=32&viewMode=grid&orderBy=0')
    print([my_elem.get_attribute("href") for my_elem in driver.find_elements(By.CSS_SELECTOR, "h2.product-title a")])
    driver.quit()
    
  • Using XPATH:

    driver.get('https://www.autofurnish.com/audi-car-accessories#/pageSize=32&viewMode=grid&orderBy=0')
    print([my_elem.get_attribute("href") for my_elem in driver.find_elements(By.XPATH, "//h2[@class='product-title']//a")])
    driver.quit()
    
  • Console Output:

    ['https://www.autofurnish.com/combo-of-7d-premium-car-pillow-neck-rest-hecta-6841-back-cushion-hecta-6851-each-set-of-two-beige', 'https://www.autofurnish.com/combo-of-7d-premium-car-pillow-neck-rest-hecta-6840-back-cushion-hecta-6850-each-set-of-two-black', 'https://www.autofurnish.com/combo-of-7d-premium-car-pillow-neck-rest-hecta-6843-back-cushion-hecta-6853-each-set-of-two-coffee', 'https://www.autofurnish.com/combo-of-7d-premium-car-pillow-neck-rest-hecta-6842-back-cushion-hecta-6852-each-set-of-two-tan', 'https://www.autofurnish.com/universal-2d-premium-leather-car-foot-mats-for-2-rows-beige', 'https://www.autofurnish.com/universal-2d-premium-leather-car-foot-mats-for-2-rows-black', 'https://www.autofurnish.com/universal-2d-premium-leather-car-foot-mats-for-2-rows-coffee', 'https://www.autofurnish.com/universal-2d-premium-leather-car-foot-mats-for-2-rows-tan', 'https://www.autofurnish.com/autofurnish-3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-beige', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-black', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-coffee', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-set-of-two-beige', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-set-of-two-black', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-set-of-two-brown', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-set-of-two', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-tan', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-beige', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-black', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-coffee', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-set-of-2-beige', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-set-of-2-black', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-set-of-2-coffee', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-set-of-2-tan', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-tan', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a4-2021-beige', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a4-2021-black', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a4-2021-coffee', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a4-2021-tan', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a6-2020-beige', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a6-2020-black', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a6-2020-coffee', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a6-2020-tan']
    

CodePudding user response:

This is a pretty tricky one... I ran into several unexpected issues trying to get this to work.

The main issue is waiting for the loading spinner and keeping it on the screen. I originally tried scrolling to the bottom of the page as you did and that puts the page into an infinite loop of loading a new section of products because the page footer is so large, the loading spinner is above the visible page (at least for me). I fixed that by scrolling to the last visible product which was enough to trigger the next section to load but not so low that it went into infinite loading mode.

In most cases when there is a loading spinner involved, you want to wait for it to become visible and then invisible. This prevents bad timing situations and is the most reliable way to wait for the new products to load.

The basic flow is

  1. Load the page
  2. Start a loop
    1. Grab all product A tags
    2. Using JS, scroll the page down to the last A tag
    3. Wait for the loading spinner to become visible and then invisible
    4. If no more products loaded or some maximum product count is reached, exit the loop
  3. Write the total product count
  4. Write the product URLs

The code

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

...

# may need to adjust the timeout based on your experience... the site is really slow for me
wait = WebDriverWait(driver, 60)
new_count = 0
old_count = 0
while True:
    old_count = new_count
    products = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h2.product-title > a"))
    new_count = len(products)

    # scroll down to last product to trigger the loading spinner
    driver.execute_script("arguments[0].scrollIntoView();", products[len(products) - 1])

    # wait for loading spinner to appear and then disappear
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.infinite-scroll-loader")))
    wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, "div.infinite-scroll-loader")))

    # if the count didn't change, we've loaded all products on the page
    # I put a max of 50 products to load as a demo. You can adjust higher as needed but you should put something reasonably sized here to prevent the script from running for an hour
    if new_count == old_count or new_count > 50
        break

# print results
print(len(products))
for product in products:
    print(product.get_attribute("href"))

CodePudding user response:

Try working according to the following algorithm:

  1. Scrape the first page of products.
  2. start scrolling:
    2.1) After each scroll wait for the new elements to be loaded
    2.2) Scrape the new elements
    2.3) Scroll the page
  3. Do this until all the products were scrolled
  4. Keep the scraped data in set to remove duplicates since after scrolling it may often occur that currently seen elements are partially duplicated with previous or next scrolled bunch.
  • Related