Home > Enterprise >  Python / Selenium. How to find relationship between data?
Python / Selenium. How to find relationship between data?

Time:07-31

I'm creating an Amazon web-scraper which just returns the name and price of all products on the search results. Will filter through a dictionary of strings (products) and collect the titles and pricing for all results. I'm doing this to calculate the average / mean of a products pricing and also to find the highest and lowest prices for that product found on Amazon.

So making the scraper was easy enough. Here's a snippet so you understand the code I am using.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Key

driver = webdriver.Chrome()
driver.get("https://www.amazon.co.uk/s?k=nike shoes&crid=25W2RSXZBGPX3&sprefix=nike shoe,aps,105&ref=nb_sb_noss_1")

# retrieving item titles
shoes = driver.find_elements(By.XPATH, '//span[@]')
shoes_list = []
for s in range(len(shoes)):
    shoes_list.append(shoes[s].text)

# retrieving prices
price = driver.find_elements(By.XPATH, '//span[@]')
price_list = []
for p in range(len(price)):
    price_list.append(price[p].text)

# prices are retuned with a newline instead of a decimal
# example: £9\n99 instead of £9.99
# so fixing that

temp_price_list = []
for price in price_list:
    price = price.replace("\n", ".")
    temp_price_list.append(price)
price_list = temp_price_list

So here's the issue. Almost without fail, Amazon have a handful of the products with no price? This really messes with things. Because once I've sorted out the data into a dataframe

title_and_price = list(zip(shoes_list[0:],price_list[0:]))
df = DataFrame(title_and_price, columns=['Product','Price'])

At some point the data gets mixed up and the price will be sorted next to the wrong product. I have left screenshots below for you to see.

Missing prices on Amazon site Incorrect data

Unfortunately, when pulling the price data, it does not pull in a 'blank' set of data if it's blank, which if it did I wouldn't need to be asking for help as I could just display a blank price next to the item and everything would still remain in order.

Is there anyway to alter the code that it would be able to detect a non-displayed price and therefore keep all the data in order? The data stays in order right up until there's a product with no price, which in every single case of an Amazon search, there is. Really appreciate any insight on this.

CodePudding user response:

To make sure price is married to shoe name, you should locate the parent element of both shoe name and price, and add them as a tuple to a list (which is to become a dataframe), like in example below:

 from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")


webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

df_list = []
url = 'https://www.amazon.co.uk/s?k=nike shoes&crid=25W2RSXZBGPX3&sprefix=nike shoe,aps,105&ref=nb_sb_noss_1'
browser.get(url)

shoes = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".s-result-item")))
for shoe in shoes:
#     print(shoe.get_attribute('outerHTML'))
    try:
        shoe_title = shoe.find_element(By.CSS_SELECTOR, ".a-text-normal")
    except Exception as e:
        continue
    try:
        shoe_price = shoe.find_element(By.CSS_SELECTOR, 'span[]')
    except Exception as e:
        continue
    df_list.append((shoe_title.text.strip(), shoe_price.text.strip()))
df = pd.DataFrame(df_list, columns = ['Shoe', 'Price'])
print(df)

This would return (depending on Amazon's appetite for serving ads in html tags similar to products):

Shoe    Price
0   Nike NIKE AIR MAX MOTION 2, Men's Running Shoe...   £79\n99
1   Nike Air Max 270 React Se GS Running Trainers ...   £69\n99
2   NIKE Women's Air Max Low Sneakers   £69\n99
3   NIKE Men's React Miler Running Shoes    £109\n99
4   NIKE Men's Revolution 5 Flyease Running Shoe    £38\n70
5   NIKE Women's W Revolution 6 Nn Running Shoe £48\n00
6   NIKE Men's Downshifter 10 Running Shoe  £54\n99
7   NIKE Women's Court Vision Low Better Basketbal...   £30\n00
8   NIKE Team Hustle D 10 Gs Gymnastics Shoe, Blac...   £20\n72
9   NIKE Men's Air Max Wright Gs Running Shoe   £68\n51
10  NIKE Men's Air Max Sc Trainers  £54\n99
11  NIKE Pegasus Trail 3 Gore-TEX Men's Waterproof...   £134\n95
12  NIKE Women's W Superrep Go 2 Sneakers   £54\n00
13  NIKE Boys Tanjun Running Shoes  £35\n53
14  NIKE Women's Air Max Bella Tr 4 Gymnastics Sho...   £28\n00
15  NIKE Men's Defy All Day Gymnastics Shoe £54\n95
16  NIKE Men's Venture Runner Sneaker   £45\n90
17  Nike Nike Court Borough Low 2 (gs), Boy's Bask...   £24\n00
18  NIKE Men's Court Royale 2 Better Essential Tra...   £25\n81
19  NIKE Men's Quest 4 Running Shoe £38\n00
20  Women Trainers Running Shoes - Air Cushion Sne...   £35\n69
21  Men Women Walking Trainers Light Running Breat...   £42\n99
22  JSLEAP Mens Running Shoes Fashion Non Slip Ath...   £44\n99
[...]

You should pay attention to a couple of things:

  • I am waiting for the element to load in page, then try to locate it, see the imports (Webdriverwait etc)
  • Your results may vary, depending on your advertising profile
  • You can select more details for each item, use ddifferent css/xpath/etc selectors, this is meant to give you a headstart only
  • Related