parsing urls using selenium when xpath have diffference in the middle part-CodePudding

I'm practicing web scraping using Selenium and trying to scrape all the product links from Lululemon->Woman's main page. But I found that when I tried to use XPath to locate product URLs and then loop through the lists, the different part of each XPath for each product is in the middle, which suggests I cannot do as I expected.

For example, the Xpath of each product is :
/html/body/div[1]/div/main/div/section/div/div[3]/div[2]/div[2]/div/div[133]/div/div/div[2]/h3/a
/html/body/div[1]/div/main/div/section/div/div[3]/div[2]/div[2]/div/div[134]/div/div/div[2]/h3/a
/html/body/div[1]/div/main/div/section/div/div[3]/div[2]/div[2]/div/div[1]/div/div/div[2]/h3/a

See, the difference of each XPath lies in 133, 134, and 1, which represent the #id of products on this page

So how can I create a full list of information of all products (if XPath works) which allows me to loop through it to get every single product's list? Can anyone help me? I pasted my current code and attached the screenshot for reference. Thank you so much!

#this is how I got the web page
driver_path = 'D:/Python/Selenium/chromedriver'
url = "https://shop.lululemon.com/c/womens-leggings/_/N-8s6"
max_pass = 5
#get each product's url
option1 =  webdriver.ChromeOptions()
option1.add_experimental_option('detach',True)
driver = webdriver.Chrome(chrome_options=option1,executable_path=driver_path)
driver.get(url)
sleep(2)

for i in range(max_pass):
    sleep(3)
    try:
        driver.find_element_by_xpath('/html/body/div[1]/div/main/div/section/div/div[4]/div/button/span').click()
    except:
        pass
    try:
        driver.find_element_by_xpath('/html/body/div[1]/div/main/div/section/div/div[2]/div/button/span').click()
    except:
        pass
sleep(3)
driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")

#the next step should be to find the pattern of where each URL is located (this should be a list), then I need to loop through the list to get "href" for every single product

#By the way, I have also tried to use class name "link lll-font-weight-medium" to locate, but I don't know why python says "Message: chrome not reachable (Session info: chrome=95.0.4638.69)"

[p.get_attribute('href') for p in driver.find_elements_by_class_name('link lll-font-weight-medium')] #this doesn't work

CodePudding user response：

To print the href attributes you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

Using CSS_SELECTOR:

driver.get("https://shop.lululemon.com/c/womens-leggings/_/N-8s6")
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h3.product-tile__product-name > a")))])

Using XPATH:

driver.get("https://shop.lululemon.com/c/womens-leggings/_/N-8s6")
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h3[contains(@class, 'product-tile__product-name')]/a")))])

Console Output:

['https://shop.lululemon.com/p/womens-leggings/Invigorate-HR-Tight-25/_/prod9750552?color=52445', 'https://shop.lululemon.com/p/womens-leggings/Wunder-Train-HR-Tight-25/_/prod9750562?color=47184', 'https://shop.lululemon.com/p/womens-leggings/Instill-High-Rise-Tight-25/_/prod10641675?color=30210', 'https://shop.lululemon.com/p/womens-leggings/Base-Pace-High-Rise-Tight-25/_/prod10641591?color=51039', 'https://shop.lululemon.com/p/womens-leggings/Align-Crop-21-Shine/_/prod10850236?color=51756', 'https://shop.lululemon.com/p/women-pants/Fast-And-Free-Tight-II-NR/_/prod8960003?color=28948', 'https://shop.lululemon.com/p/women-pants/Align-Pant-Full-Length-28/_/prod8780551?color=46741', 'https://shop.lululemon.com/p/women-pants/Align-Pant-2/_/prod2020012?color=26950', 'https://shop.lululemon.com/p/women-pants/Align-Pant-Super-Hi-Rise-28/_/prod9200552?color=26083']

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

CodePudding user response：

Getting all links of the displayed products you can go with xpath but in my opinion css selectors are quiet more comfortable:

for a in driver.find_elements(By.CSS_SELECTOR, '[data-testid="product-list"] h3 a'):
    print(a.get_attribute('href'))

Instead of printing in the iteration you can also append them to a list or process the single product page directly.

Example (selenium 4)

...
driver.get(url)
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(0.5)
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

for a in driver.find_elements(By.CSS_SELECTOR, '[data-testid="product-list"] h3 a'):
    print(a.get_attribute('href'))