Python / Selenium. How to find relationship between data?-CodePudding

I'm creating an Amazon web-scraper which just returns the name and price of all products on the search results. Will filter through a dictionary of strings (products) and collect the titles and pricing for all results. I'm doing this to calculate the average / mean of a products pricing and also to find the highest and lowest prices for that product found on Amazon.

So making the scraper was easy enough. Here's a snippet so you understand the code I am using.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Key

driver = webdriver.Chrome()
driver.get("https://www.amazon.co.uk/s?k=nike shoes&crid=25W2RSXZBGPX3&sprefix=nike shoe,aps,105&ref=nb_sb_noss_1")

# retrieving item titles
shoes = driver.find_elements(By.XPATH, '//span[@]')
shoes_list = []
for s in range(len(shoes)):
    shoes_list.append(shoes[s].text)

# retrieving prices
price = driver.find_elements(By.XPATH, '//span[@]')
price_list = []
for p in range(len(price)):
    price_list.append(price[p].text)

# prices are retuned with a newline instead of a decimal
# example: £9\n99 instead of £9.99
# so fixing that

temp_price_list = []
for price in price_list:
    price = price.replace("\n", ".")
    temp_price_list.append(price)
price_list = temp_price_list

So here's the issue. Almost without fail, Amazon have a handful of the products with no price? This really messes with things. Because once I've sorted out the data into a dataframe

title_and_price = list(zip(shoes_list[0:],price_list[0:]))
df = DataFrame(title_and_price, columns=['Product','Price'])

At some point the data gets mixed up and the price will be sorted next to the wrong product. I have left screenshots below for you to see.

Missing prices on Amazon site Incorrect data

Unfortunately, when pulling the price data, it does not pull in a 'blank' set of data if it's blank, which if it did I wouldn't need to be asking for help as I could just display a blank price next to the item and everything would still remain in order.

Is there anyway to alter the code that it would be able to detect a non-displayed price and therefore keep all the data in order? The data stays in order right up until there's a product with no price, which in every single case of an Amazon search, there is. Really appreciate any insight on this.

CodePudding user response：

To make sure price is married to shoe name, you should locate the parent element of both shoe name and price, and add them as a tuple to a list (which is to become a dataframe), like in example below:

 from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")


webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

df_list = []
url = 'https://www.amazon.co.uk/s?k=nike shoes&crid=25W2RSXZBGPX3&sprefix=nike shoe,aps,105&ref=nb_sb_noss_1'
browser.get(url)

shoes = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".s-result-item")))
for shoe in shoes:
#     print(shoe.get_attribute('outerHTML'))
    try:
        shoe_title = shoe.find_element(By.CSS_SELECTOR, ".a-text-normal")
    except Exception as e:
        continue
    try:
        shoe_price = shoe.find_element(By.CSS_SELECTOR, 'span[]')
    except Exception as e:
        continue
    df_list.append((shoe_title.text.strip(), shoe_price.text.strip()))
df = pd.DataFrame(df_list, columns = ['Shoe', 'Price'])
print(df)

This would return (depending on Amazon's appetite for serving ads in html tags similar to products):

Shoe    Price
0   Nike NIKE AIR MAX MOTION 2, Men's Running Shoe...   £79\n99
1   Nike Air Max 270 React Se GS Running Trainers ...   £69\n99
2   NIKE Women's Air Max Low Sneakers   £69\n99
3   NIKE Men's React Miler Running Shoes    £109\n99
4   NIKE Men's Revolution 5 Flyease Running Shoe    £38\n70
5   NIKE Women's W Revolution 6 Nn Running Shoe £48\n00
6   NIKE Men's Downshifter 10 Running Shoe  £54\n99
7   NIKE Women's Court Vision Low Better Basketbal...   £30\n00
8   NIKE Team Hustle D 10 Gs Gymnastics Shoe, Blac...   £20\n72
9   NIKE Men's Air Max Wright Gs Running Shoe   £68\n51
10  NIKE Men's Air Max Sc Trainers  £54\n99
11  NIKE Pegasus Trail 3 Gore-TEX Men's Waterproof...   £134\n95
12  NIKE Women's W Superrep Go 2 Sneakers   £54\n00
13  NIKE Boys Tanjun Running Shoes  £35\n53
14  NIKE Women's Air Max Bella Tr 4 Gymnastics Sho...   £28\n00
15  NIKE Men's Defy All Day Gymnastics Shoe £54\n95
16  NIKE Men's Venture Runner Sneaker   £45\n90
17  Nike Nike Court Borough Low 2 (gs), Boy's Bask...   £24\n00
18  NIKE Men's Court Royale 2 Better Essential Tra...   £25\n81
19  NIKE Men's Quest 4 Running Shoe £38\n00
20  Women Trainers Running Shoes - Air Cushion Sne...   £35\n69
21  Men Women Walking Trainers Light Running Breat...   £42\n99
22  JSLEAP Mens Running Shoes Fashion Non Slip Ath...   £44\n99
[...]

You should pay attention to a couple of things:

I am waiting for the element to load in page, then try to locate it, see the imports (Webdriverwait etc)
Your results may vary, depending on your advertising profile
You can select more details for each item, use ddifferent css/xpath/etc selectors, this is meant to give you a headstart only