I'm creating an Amazon web-scraper which just returns the name and price of all products on the search results. Will filter through a dictionary of strings (products) and collect the titles and pricing for all results. I'm doing this to calculate the average / mean of a products pricing and also to find the highest and lowest prices for that product found on Amazon.
So making the scraper was easy enough. Here's a snippet so you understand the code I am using.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Key
driver = webdriver.Chrome()
driver.get("https://www.amazon.co.uk/s?k=nike shoes&crid=25W2RSXZBGPX3&sprefix=nike shoe,aps,105&ref=nb_sb_noss_1")
# retrieving item titles
shoes = driver.find_elements(By.XPATH, '//span[@]')
shoes_list = []
for s in range(len(shoes)):
shoes_list.append(shoes[s].text)
# retrieving prices
price = driver.find_elements(By.XPATH, '//span[@]')
price_list = []
for p in range(len(price)):
price_list.append(price[p].text)
# prices are retuned with a newline instead of a decimal
# example: £9\n99 instead of £9.99
# so fixing that
temp_price_list = []
for price in price_list:
price = price.replace("\n", ".")
temp_price_list.append(price)
price_list = temp_price_list
So here's the issue. Almost without fail, Amazon have a handful of the products with no price? This really messes with things. Because once I've sorted out the data into a dataframe
title_and_price = list(zip(shoes_list[0:],price_list[0:]))
df = DataFrame(title_and_price, columns=['Product','Price'])
At some point the data gets mixed up and the price will be sorted next to the wrong product. I have left screenshots below for you to see.
Missing prices on Amazon site Incorrect data
Unfortunately, when pulling the price data, it does not pull in a 'blank' set of data if it's blank, which if it did I wouldn't need to be asking for help as I could just display a blank price next to the item and everything would still remain in order.
Is there anyway to alter the code that it would be able to detect a non-displayed price and therefore keep all the data in order? The data stays in order right up until there's a product with no price, which in every single case of an Amazon search, there is. Really appreciate any insight on this.
CodePudding user response:
To make sure price is married to shoe name, you should locate the parent element of both shoe name and price, and add them as a tuple to a list (which is to become a dataframe), like in example below:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
df_list = []
url = 'https://www.amazon.co.uk/s?k=nike shoes&crid=25W2RSXZBGPX3&sprefix=nike shoe,aps,105&ref=nb_sb_noss_1'
browser.get(url)
shoes = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".s-result-item")))
for shoe in shoes:
# print(shoe.get_attribute('outerHTML'))
try:
shoe_title = shoe.find_element(By.CSS_SELECTOR, ".a-text-normal")
except Exception as e:
continue
try:
shoe_price = shoe.find_element(By.CSS_SELECTOR, 'span[]')
except Exception as e:
continue
df_list.append((shoe_title.text.strip(), shoe_price.text.strip()))
df = pd.DataFrame(df_list, columns = ['Shoe', 'Price'])
print(df)
This would return (depending on Amazon's appetite for serving ads in html tags similar to products):
Shoe Price
0 Nike NIKE AIR MAX MOTION 2, Men's Running Shoe... £79\n99
1 Nike Air Max 270 React Se GS Running Trainers ... £69\n99
2 NIKE Women's Air Max Low Sneakers £69\n99
3 NIKE Men's React Miler Running Shoes £109\n99
4 NIKE Men's Revolution 5 Flyease Running Shoe £38\n70
5 NIKE Women's W Revolution 6 Nn Running Shoe £48\n00
6 NIKE Men's Downshifter 10 Running Shoe £54\n99
7 NIKE Women's Court Vision Low Better Basketbal... £30\n00
8 NIKE Team Hustle D 10 Gs Gymnastics Shoe, Blac... £20\n72
9 NIKE Men's Air Max Wright Gs Running Shoe £68\n51
10 NIKE Men's Air Max Sc Trainers £54\n99
11 NIKE Pegasus Trail 3 Gore-TEX Men's Waterproof... £134\n95
12 NIKE Women's W Superrep Go 2 Sneakers £54\n00
13 NIKE Boys Tanjun Running Shoes £35\n53
14 NIKE Women's Air Max Bella Tr 4 Gymnastics Sho... £28\n00
15 NIKE Men's Defy All Day Gymnastics Shoe £54\n95
16 NIKE Men's Venture Runner Sneaker £45\n90
17 Nike Nike Court Borough Low 2 (gs), Boy's Bask... £24\n00
18 NIKE Men's Court Royale 2 Better Essential Tra... £25\n81
19 NIKE Men's Quest 4 Running Shoe £38\n00
20 Women Trainers Running Shoes - Air Cushion Sne... £35\n69
21 Men Women Walking Trainers Light Running Breat... £42\n99
22 JSLEAP Mens Running Shoes Fashion Non Slip Ath... £44\n99
[...]
You should pay attention to a couple of things:
- I am waiting for the element to load in page, then try to locate it, see the imports (Webdriverwait etc)
- Your results may vary, depending on your advertising profile
- You can select more details for each item, use ddifferent css/xpath/etc selectors, this is meant to give you a headstart only