How can I resolve this IndexError in Python webscrape function?-CodePudding

I have been attempting to use Selenium to complete a webscrape of product information, and have managed to successfully scrape the information I need from a single category page containing 40 products.

I plan to find a way to automatically find the hrefs for all category pages and then scrape each one. I understand that Scrapy-Playwright is probably a better solution but I have investigated this option and there's a reason I cannot use it.

Before I figure out a way of scraping multiple categories automatically, I am trying to build a function that loops through multiple category pages that I provide in a csv file, to test that it works.

When attempting to do so I repeatedly receive the error:

Traceback (most recent call last):
  File "h:\Python\aldi\aldi\selenium_test2.py", line 76, in <module>
    scraper(cat_urls)
  File "h:\Python\aldi\aldi\selenium_test2.py", line 71, in scraper
    temp_prod_info = [ {'product_id': id_list[i], 'product_name': name_list[i], 'price': price_list[i] } for i in range(len(names)) ]
  File "h:\Python\aldi\aldi\selenium_test2.py", line 71, in <listcomp>
    temp_prod_info = [ {'product_id': id_list[i], 'product_name': name_list[i], 'price': price_list[i] } for i in range(len(names)) ]
IndexError: list index out of range

The csv file 'url_list.csv' contains the following two lines:

https://groceries.aldi.co.uk/en-GB/bakery
https://groceries.aldi.co.uk/en-GB/fresh-food

Here is my code:

# from weakref import proxy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from selenium.webdriver.common.by import By
import time
from fp.fp import FreeProxy


options = webdriver.ChromeOptions() 
options.add_experimental_option("excludeSwitches", ["enable-logging"])
options.add_argument('--disable-blink-features=AutomationControlled')

with open('url_list.csv') as f:
    cat_urls = [line.strip() for line in f]

name_list=[]
price_list=[]
href_list=[]
id_list= []
prod_info= []

proxy = FreeProxy(rand=True, country_id=['GB']).get()

def scraper(cat_urls):
    for cat_url in cat_urls:
        options.add_argument('--proxy-server=%s' % proxy)
        driver = webdriver.Chrome(options=options)
        driver.get(cat_url)
        driver.find_element('xpath','//*[@id="onetrust-accept-btn-handler"]').click()
        time.sleep(5)
        
        names = driver.find_elements(By.CLASS_NAME, 'product-tile-text.text-center.px-3.mb-3')
               
        def name_lister (names, name_list):
            for name in names:
                text_name = name.text
                name_list.append(text_name)
            return name_list
        name_lister(names, name_list)

        prices = driver.find_elements(By.CSS_SELECTOR, 'span.h4')
        
        def price_lister (prices, price_list):
            for price in prices:
                text_price = price.text.strip('Â£')
                price_list.append(text_price)
            return price_list
        price_lister(prices, price_list)

        hrefs = driver.find_elements(By.CLASS_NAME, 'p.text-default-font')
        
        def href_lister (hrefs, href_list):
            for href in hrefs:
                text_id = href.get_attribute('href')
                href_list.append(text_id)
            return href_list
        href_lister(hrefs, href_list)
        
        def id_splitter (href_list, id_list):
            for href in href_list:
                if href is not None:
                    id = href[-13:]
                else:
                    id = ""
                    id_list.append(id)
            return id_list
        id_splitter(href_list, id_list)
        
        # prod_info = [ {'product_id': id_list[i], 'product_name': name_list[i], 'price': price_list[i] } for i in range(len(name_list)) ]
        temp_prod_info = [ {'product_id': id_list[i], 'product_name': name_list[i], 'price': price_list[i] } for i in range(len(names)) ]
        prod_info.append(temp_prod_info)
      
    return prod_info

scraper(cat_urls)

df = pd.DataFrame.from_dict(prod_info, orient='columns')
df.to_csv("scrape_data.csv")

I have commented out my original code for writing the prod_info dictionary, and I have also tried a few other versions of the prod_info section at the bottom of the function, but nothing seems to work.

I would appreciate it if someone could offer advice on how to correct this, as I cannot see a way forward. Thanks a lot in advance.

CodePudding user response：

The logic seems to be legit, however one or more of the names, prices or/and hrefs cannot be found on the page for every link that you are trying to scrape. I would recommend printing these as KunduK suggested and look for the one that is too short.

Also, I would advice to implement some try: ... except: logic here when running this in a large-scale automated way.