data not being fetched properly from a website-CodePudding

I want to get the urls of the products from a particular website and then get more data thereafter (but I'm currently stuck here in just getting the urls) Here are what I tried:

here are the modules I used

import bs4
import pandas as pd
import numpy as np
import random
import requests
from lxml import etree
import time
from tqdm.notebook import tqdm

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotInteractableException
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager

I tried a bunch of workarounds and here are some of them (which I thought are correct but I don't know why not working at all)

**1. **

driver = webdriver.Chrome(ChromeDriverManager().install())
urls = []

for page in tqdm(range(2, 10)):
    driver.get("https://www.sephora.com/shop/skincare?currentPage=" str(page))
    for order in tqdm(range(0,70)):
        skincare = driver.find_elements(By.XPATH, "//main[@class='css-1owb2na']//div[@data-comp='ProductGrid ']//div[@class='css-1322gsb']//div[@class='css-1qe8tjm'][@style='order:"  str(order)  " ;']//a[@class='css-klx76']")
        for _skincare in skincare:
            urls.append({"url":_skincare.get_attribute('href')})
driver.quit()

I only get nothing, "urls" is blank.

driver = webdriver.Chrome(ChromeDriverManager().install())
urls = []

for page in tqdm(range(2, 10)):
    driver.get("https://www.sephora.com/shop/skincare?currentPage=" str(page))
    for order in tqdm(range(0,70)):
        skincare = driver.find_elements(By.XPATH, "//div[@style='order:"  str(order)  " ;']//a[@class='css-klx76']")
        for _skincare in skincare:
            urls.append({"url":_skincare.get_attribute('href')})
driver.quit()

I get the same, nothing, it's probably because of the path, when I call "skincare" it's also blank.

driver = webdriver.Chrome(ChromeDriverManager().install())
urls = []
for page in tqdm(range(2, 50)):
    driver.get("https://www.sephora.com/shop/skincare?currentPage=" str(page))
    skincare = driver.find_elements(By.CLASS_NAME, 'css-klx76')
    for _skincare in skincare:
        urls.append({"url":_skincare.get_attribute('href')})
driver.quit()

With this one, using "By.CLASS_NAME" I only get a few of the many urls I need per page (meaning it's still not the right one).

I'm apparently doing something wrong with the path here but I can't find it now. Any comments? Thanks!

Tried to scrape data from a website and expected to get the urls/links but didn't get it.

CodePudding user response：

Try this code, this will navigate to each page by page number, scroll down to the bottom, then fetch all the products' URLs.

for page in range(1, 10):
    driver.get("https://www.sephora.com/shop/skincare?currentPage=" str(page))
    while True:
        driver.execute_script("window.scrollBy(0, 800);")
        sleep(1)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

    sleep(3)

    skincare = driver.find_elements(By.XPATH, ".//a[@class='css-klx76']")
    print("Total URLs:", len(skincare))
    i = 1
    for _skincare in skincare:
        urls.append({f"url-{i}": _skincare.get_attribute('href')})
        i  = 1