Using Selenium, Python and XPATH to try to grab image urls from a website, doesn't work-CodePudding

None of this seems to work, the browser just closes or it just prints "NONE" Any idea if it's wrong xpaths or what is going on?

Thanks a lot, in advance

Here's the HTML containing the image:

<a data-altimg="" data-prdcount="" href="/product/prd-5178/levis-505-regular-jeans-men.jsp?prdPV=5" rel="/product/prd-5178/levis-505-regular-jeans-men.jsp?prdPV=5">
              <img alt="Men's Levi's® 505™ Regular Jeans"  title="Men's Levi's® 505™ Regular Jeans" width="120px" data-herosrc="https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=240&amp;hei=240&amp;op_sharpen=1" loading="lazy" srcset="https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=240&amp;hei=240&amp;op_sharpen=1 240w, https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=152&amp;hei=152&amp;op_sharpen=1 152w" sizes="(max-width: 728px) 20px" src="https://media.kohlsimg.com/is/image/kohls/5178_Light_Blue?wid=240&amp;hei=240&amp;op_sharpen=1">
            </a>

Here's my script:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#from selenium.webdriver.common.action_chains import ActionChains
import time


# Start a webdriver instance
browser = webdriver.Firefox()

# Navigate to the page you want to scrape
browser.get('https://www.kohls.com/catalog/mens-clothing.jsp?CN=Gender:Mens Department:Clothing&cc=mens-TN2.0-S-mensclothing')
time.sleep(12)

#images = browser.find_elements(By.XPATH, "//img[@class='pmp-hero-img']")
#images = browser.find_elements(By.CLASS_NAME, 'pmp-hero-img')
images = browser.find_elements(By.XPATH, "/html/body/div[2]/div[2]/div[2]/div[2]/div[1]/div/div/div[3]/div/div[4]/ul/li[*]/div[1]/div[2]/a/img")
#images = browser.find_elements(By.XPATH, "//*[@id='root_panel4124695']/div[4]/ul/li[5]/div[1]/div[2]/a/img")

 
for image in images:
    prod_img = (image.get_attribute("src"))
    print(prod_img)


# Close the webdriver instance
browser.close()

Tried to get the url's , wasn't successful

CodePudding user response：

First - do Not use very long xpath strings. They are hard to read and work it.

You can't find your images like this:

images = browser.find_elements(By.CSS,
'img[]')

Now, the attribute you want to find:

for image in images:
    prod_img = (image.get_attribute("data-herosrc"))
    print(prod_img)

CodePudding user response：

As I've said in my comment I suggest always doing a request only approach. There are some very limited use cases when one should do a browser based web automation.

First I would like to give you a step by step instruction on how I would do such a job.

Go to the website look for the data you want to be scraped
Open up the Browser Dev Tools and go to Networking
Hardreload the page and look for the Backend API calls that give you the data you are looking for
If the Site is SSR with PHP for example you would need to extract the data from the raw HTML. But most sites today are CSR and receive their content dynamically.

The Biggest "pro" of doing this is that you can extract way more content out of a request. Most APIs deliver their data in a JSON format which one can directly use. Now let's look at your example:

While inspecting the Network tap this request came to my attention:

https://api-bd.kohls.com/v1/ede/experiences?cid=WebStore&pgid=PMP&plids=Horizontal1|15

Further inspecting shows that this api call gives us all the products and corresponding information like Image urls. Not all you need to do is to check if you can further manipulate the call to give you more products and then save the urls.

When we inspect the API call with Postman we can see that one parameter is the following:

Horizontal1|15

It seems that the 15 at the end corresponds with the number of products received by the backend. Let's test it with 100.

https://api-bd.kohls.com/v1/ede/experiences?cid=WebStore&pgid=PMP&plids=Horizontal1|100

I was right changing this parameter of the URL gets us more products. Let's see what's the upper boundary is. Lets change the parameter to the max amount of products.

I've tested it. It did not work.The upper boundary is 155. So you can Scrape 155 products per request. Not too shabby. But how do we retrieve the rest? Let's further investigate that url.

Mhm... Seems like this website we can't get the data for the following pages with the same url as they are using another url for the following pages. That's a bummer.

Here is the code for the first page:

import requests

url = "https://api-bd.kohls.com/v1/ede/experiences?cid=WebStore&pgid=PMP&plids=Horizontal1|100"

payload = "{\"departmentName\":\"Clothing\",\"gender\":\"Mens\",\"mcmId\":\"39824086452562678713609272051249254622\"}"
headers = {
  'x-app-api_key': 'NQeOQ7owHHPkdkMkKuH5tPpGu0AvIIOu',
  'Content-Type': 'text/plain',
  'Cookie': '_abck=90C88A9A2DEE673DCDF0BE9C3126D29B~-1~YAAQnTD2wapYufCEAQAA /cLUAmeLRA xZuD/BVImKI dwMVjQ/jXiYnVjPyi/kRL3dKnruzGKHvwWDd8jBDcwGHHKgJbJ0Hyg60cWzpwLDLr7QtA969asl8ENsgUF6Qu37tVpmds9K7H/k4zr2xufBDD/QcUejcrvj3VGnWbgLCb6MDhUJ35QPh41dOVUzJehpnZDOs/fucNOil1CGeqhyKazN9f16STd4T8mBVhzh3c6NVRrgPV1a 5itJfP NryOPkUj4L1C9X5DacYEUJauOgaKhoTPoHxXXvKyjmWwYJJJ sdU05zJSWvM5kYuor15QibXx714mO3aBuYIAHY3k3vtOaDs2DqGbpS/dnjJAiRQ8dmC5ft9 PvPtaeFFxflv8Ldo KTViHuYAqTNWntvQrinZxAif8pJnMzd00ipxmrj2NrLgxIKQOu/s1VNsaXrLiAIxADl7nMm7lAEr5rxKa27/RjCxA SLuaz0w9PnYdxUdyfqKeWuTLy5EmRCUCYgzyhO3i8dUTSSgDLglilJMM9~0~-1~1672088271; _dyid_server=7331860388304706917; ak_bmsc=B222629176612AB1EBF71F96EAB74FA1~000000000000000000000000000000~YAAQnTD2wXhfufCEAQAAxuAOUBKVYljdVEM6mA086TVhGvlLmbRxihQ 5L1dtLAKrX5oFG1qC dg6EbPflwDVA7cwPkft84vUGj0bJkllZnVb0FZKSuVwD728oW1 rCdos7GLBUTkq3XFzCXh/qMr8oagYPHKMIEzXb839 BKmOjGlNvBQsP/eJm BmxnSlYq03uLpBZVRdmPX7mDAq2KyPq9kCnB 6o D eVfzchurFxbpvmWb XCG0oAD V5PgW3nsSey99M27WSy4LMyFFljUqLPkSdTRFQGrm8Wfwci6rWuoGgVpF00JAVBpdO2eIVjxQdBVXS7q5CmNYRifMU3I1GpLUr6EH kKoeMiDQNhvU95KXg/e8lrTkvaaJLOs5BZjeC3ueLY; bm_sv=CF184EA45C8052AF231029FD15170EBD~YAAQnTD2wSxgufCEAQAARkkPUBKJBEwgLsWkuV8MSzWmw5svZT0N7tUML8V5x3su83RK3/7zJr0STY4BrULhET6zGrHeEo1xoSz0qvgRHB3NGYVE6QFAhRZQ4qnqNoLBxM/EhIXl2wBere10BrAtmc8lcIYSGkPr8emEekEQ9bBLUL9UqXyJWSoaDjlY7Z2NdEQVQfO5Z8NxQv5usQXOBCqW/ukgxbuM3C5S2byDmjLtU7f2S5VjdimJ3kNSzD80~1; 019846f7bdaacd7a765789e7946e59ec=52e83be20b371394f002256644836518; akacd_EDE_GCP=2177452799~rv=5~id=a309910885f7706f566b983652ca61e9'
}

response = requests.request("POST", url, headers=headers, data=payload)

data = response.json()
print(data)
for product in data["payload"]["experiences"][0]["expPayload"]["products"]:
    print(product["image"]["url"])

Do something similar for the following pages and you will be set.