Home > Back-end >  Trying to work with selenium and scrapy both
Trying to work with selenium and scrapy both

Time:03-04

I am to scrape dynamic website but the selenium will provide me these error 'chromedriver' executable needs to be in PATH Can you solve these problem

from scrapy import Spider
from scrapy.http import Request
from scrapy.utils.project import get_project_settings
from selenium import webdriver


class AuthorSpider(Spider):
    name = 'pushpa'


    def start_requests(self):
        self.driver = webdriver.Chrome(executable_path='C:/Program Files (x86)/chromedriver')
        driver = webdriver.Chrome(driver_path, options=options)
        driver.get('https://www.lazada.com.ph/shop-laptops/')
        link_elements = driver.find_elements_by_xpath(
            '//*[@data-qa-locator="product-item"]//a[text()]')

        for link in link_elements:
            yield{
                'url':link
            }

CodePudding user response:

executable_path should be set to absolute path to chromedriver.exe file containing the chromedriver.exe file itself.
So, in case your chromedriver.exe is inside the 'C:/Program Files (x86)/chromedriver' folder it should be

self.driver = webdriver.Chrome(executable_path='C:/Program Files (x86)/chromedriver/chromedriver.exe')

Also I don't understand why are you defining and initializing 2 objects of the driver? :

self.driver = webdriver.Chrome(executable_path='C:/Program Files (x86)/chromedriver')
driver = webdriver.Chrome(driver_path, options=options)

CodePudding user response:

The perfect solution is SeleniumRequest. To use SeleniumRequest with scrapy, scrapy project is a must.

Script:

import scrapy
from scrapy_selenium import SeleniumRequest

class AuthorSpider(scrapy.Spider):
    name = 'pushpa'
    def start_requests(self):
        url='https://www.lazada.com.ph/shop-laptops/'
        yield SeleniumRequest(
                url=url,
                wait_time=5,
                callback=self.parse
                )

    def parse(self, response):    
        
        link_elements = response.xpath ('//*[@data-qa-locator="product-item"]//a[text()]/@href').getall()

        for link in link_elements:
            link=f'https:{link}'
            yield {
                'url':link }

Output:

{'url': 'https://www.lazada.com.ph/products/coreldraw-graphics-suite-x6-dvd-pc-installer-i1733548522-s7464446610.html?search=1'}
2022-03-04 02:47:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.com.ph/shop-laptops/>
{'url': 'https://www.lazada.com.ph/products/laptop-hp-probook-4545s-amd-a4-4300m-4gb-ram-ddr3-250gb-hdd-radeon-hd-graphics-i1208954033-s13141803102.html?search=1'}
2022-03-04 02:47:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.com.ph/shop-laptops/>
{'url': 'https://www.lazada.com.ph/products/gift-monitor-17inlaptop-for-sale-brand-new-9470m9480m-i-laptop-i5-i-light-and-portable-i-14in-i-fourth-generation-processor-i-core-intel-i5-i-16gb-ram-i-480gb-ssd-i-built-in-camera-hdmi-hd-interface-i-suitable-for-online-courses-learni-i2732325355-s13083117290.html?search=1&freeshipping=1'}
2022-03-04 02:47:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.com.ph/shop-laptops/>
{'url': 'https://www.lazada.com.ph/products/acer-predator-helios-300-70bf-ph315-54-70bf-gaming-laptop-144hz-ips-panel-intel-core-i7-11800h-8-cores-rtx-3050ti-16gb-ram-512gb-ssd-pc-central-i2590081219-s12159342297.html?search=1'}
2022-03-04 02:47:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.com.ph/shop-laptops/>
{'url': 'https://www.lazada.com.ph/products/free-air-fryerlaptop-i-l460-i-14in-i-6th-generation-processor-i-core-i5-i-4gb8gb16gb-memory-i-256gb-ssd480gb-ssd-i-compatible-with-windows10-suitable-for-learning-work-online-i2388508967-s10876939835.html?search=1&freeshipping=1'}

... so on

settings.py file:

You have to add the following portion in settings.py file

# Middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

# Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['--headless']
  • Related