I have been attempting to use Selenium to complete a webscrape of product information, and have managed to successfully scrape the information I need from a single category page containing 40 products.
I plan to find a way to automatically find the hrefs for all category pages and then scrape each one. I understand that Scrapy-Playwright is probably a better solution but I have investigated this option and there's a reason I cannot use it.
Before I figure out a way of scraping multiple categories automatically, I am trying to build a function that loops through multiple category pages that I provide in a csv file, to test that it works.
When attempting to do so I repeatedly receive the error:
Traceback (most recent call last):
File "h:\Python\aldi\aldi\selenium_test2.py", line 76, in <module>
scraper(cat_urls)
File "h:\Python\aldi\aldi\selenium_test2.py", line 71, in scraper
temp_prod_info = [ {'product_id': id_list[i], 'product_name': name_list[i], 'price': price_list[i] } for i in range(len(names)) ]
File "h:\Python\aldi\aldi\selenium_test2.py", line 71, in <listcomp>
temp_prod_info = [ {'product_id': id_list[i], 'product_name': name_list[i], 'price': price_list[i] } for i in range(len(names)) ]
IndexError: list index out of range
The csv file 'url_list.csv' contains the following two lines:
https://groceries.aldi.co.uk/en-GB/bakery
https://groceries.aldi.co.uk/en-GB/fresh-food
Here is my code:
# from weakref import proxy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from selenium.webdriver.common.by import By
import time
from fp.fp import FreeProxy
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-logging"])
options.add_argument('--disable-blink-features=AutomationControlled')
with open('url_list.csv') as f:
cat_urls = [line.strip() for line in f]
name_list=[]
price_list=[]
href_list=[]
id_list= []
prod_info= []
proxy = FreeProxy(rand=True, country_id=['GB']).get()
def scraper(cat_urls):
for cat_url in cat_urls:
options.add_argument('--proxy-server=%s' % proxy)
driver = webdriver.Chrome(options=options)
driver.get(cat_url)
driver.find_element('xpath','//*[@id="onetrust-accept-btn-handler"]').click()
time.sleep(5)
names = driver.find_elements(By.CLASS_NAME, 'product-tile-text.text-center.px-3.mb-3')
def name_lister (names, name_list):
for name in names:
text_name = name.text
name_list.append(text_name)
return name_list
name_lister(names, name_list)
prices = driver.find_elements(By.CSS_SELECTOR, 'span.h4')
def price_lister (prices, price_list):
for price in prices:
text_price = price.text.strip('£')
price_list.append(text_price)
return price_list
price_lister(prices, price_list)
hrefs = driver.find_elements(By.CLASS_NAME, 'p.text-default-font')
def href_lister (hrefs, href_list):
for href in hrefs:
text_id = href.get_attribute('href')
href_list.append(text_id)
return href_list
href_lister(hrefs, href_list)
def id_splitter (href_list, id_list):
for href in href_list:
if href is not None:
id = href[-13:]
else:
id = ""
id_list.append(id)
return id_list
id_splitter(href_list, id_list)
# prod_info = [ {'product_id': id_list[i], 'product_name': name_list[i], 'price': price_list[i] } for i in range(len(name_list)) ]
temp_prod_info = [ {'product_id': id_list[i], 'product_name': name_list[i], 'price': price_list[i] } for i in range(len(names)) ]
prod_info.append(temp_prod_info)
return prod_info
scraper(cat_urls)
df = pd.DataFrame.from_dict(prod_info, orient='columns')
df.to_csv("scrape_data.csv")
I have commented out my original code for writing the prod_info dictionary, and I have also tried a few other versions of the prod_info section at the bottom of the function, but nothing seems to work.
I would appreciate it if someone could offer advice on how to correct this, as I cannot see a way forward. Thanks a lot in advance.
CodePudding user response:
The logic seems to be legit, however one or more of the names, prices or/and hrefs cannot be found on the page for every link that you are trying to scrape. I would recommend printing these as KunduK suggested and look for the one that is too short.
Also, I would advice to implement some try: ... except: logic here when running this in a large-scale automated way.