KeyError: 'driver' scrapy and selenium together-CodePudding

They will scrape the first page when the move to second page they show KeyError: 'driver' is there any solution for these I want to create a webcrawler using scrapy-selenium. these is page link https://barreau-montpellier.com/annuaire-professionnel/?cn-s My code looks like this:

    import scrapy
    from scrapy import Selector
    from scrapy_selenium import SeleniumRequest
    
    class TestSpider(scrapy.Spider):
        name = 'test'
        page_number=1
        
        def start_requests(self):
          yield SeleniumRequest(url='https://barreau-montpellier.com/annuaire-professionnel/?cn-s=',callback=self.parse)
        
    
    
        def parse(self, response):
            driver=response.meta['driver']
            r = Selector(text=driver.page_source)
        
            details=r.xpath("//div[@class='cn-entry cn-background-gradient']")
            for detail in details:
                email=detail.xpath(".//span[@class='email cn-email-address']//a//@href").get()
                try:
                    email=email.replace("mailto:","")
                except:
                    email=''
                
                n1=detail.xpath(".//span[@class='given-name']//text()").get()
                n2=detail.xpath(".//span[@class='family-name']//text()").get()
                name=n1 n2
                
                
                telephone=detail.xpath(".//span[@class='tel cn-phone-number cn-phone-number-type-workphone']//a//text()").get()
                
                
                fax=detail.xpath(".//span[@class='tel cn-phone-number cn-phone-number-type-workfax']//a//text()").get()
    
                
                street=detail.xpath(".//span[@class='adr cn-address']//span[@class='street-address notranslate']//text()").get()
                locality=detail.xpath(".//span[@class='adr cn-address']//span[@class='locality']//text()").get()
                code=detail.xpath(".//span[@class='adr cn-address']//span[@class='postal-code']//text()").get()
                address=street locality code
                
                yield{
                    'name':name,
                    'mail':email,
                    'telephone':telephone,
                    'Fax':fax,
                    'address':address
                }
                next_page = 'https://barreau-montpellier.com/annuaire-professionnel/pg/'  str(TestSpider.page_number) '/?cn-s' 
                if TestSpider.page_number<=155:
                    TestSpider.page_number  = 1
                    yield response.follow(next_page, callback = self.parse,)

In setting .py I have added these:

from shutil import which
  
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('C:\Program Files (x86)\chromedriver.exe')
SELENIUM_DRIVER_ARGUMENTS=['--headless']  
  
DOWNLOADER_MIDDLEWARES = {
     'scrapy_selenium.SeleniumMiddleware': 800
     }

CodePudding user response：

Actually, Why are you getting key error driver? Most likely, I'm clear about it after testing your code more than once. Have you ever tested your code without pagination portion? I also got key error driver but when I get rid of the pagination part the error has gone disappeared. So for the incorrect next pages/pagination, you are getting key error driver. I've made the pagination in def start_requests(self) using range function and it's working fine without any issues plus this type of pagination is two times faster than others.

Full working code:

import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest

class TestSpider(scrapy.Spider):
    name = 'test'
    page_number = 1

    def start_requests(self):
        urls = ['https://barreau-montpellier.com/annuaire-professionnel/pg/' str(x) '/?cn-s' for x in range(1,156)]
        for url in urls:
            yield SeleniumRequest(
                url= url,
                callback=self.parse,
                wait_time=3)

    def parse(self, response):

        driver = response.meta['driver']
        r = Selector(text=driver.page_source)

        details = r.xpath(
                "//div[@class='cn-entry cn-background-gradient']")
        for detail in details:
            email = detail.xpath(
                    ".//span[@class='email cn-email-address']//a//@href").get()
            try:
                email = email.replace("mailto:", "")
            except:
                email = ''

            n1 = detail.xpath(".//span[@class='given-name']//text()").get()
            n2 = detail.xpath(
                    ".//span[@class='family-name']//text()").get()
            name = n1 n2

            telephone = detail.xpath(
                    ".//span[@class='tel cn-phone-number cn-phone-number-type-workphone']//a//text()").get()

            fax = detail.xpath(
                    ".//span[@class='tel cn-phone-number cn-phone-number-type-workfax']//a//text()").get()

            street = detail.xpath(
                    ".//span[@class='adr cn-address']//span[@class='street-address notranslate']//text()").get()
            locality = detail.xpath(
                    ".//span[@class='adr cn-address']//span[@class='locality']//text()").get()
            code = detail.xpath(
                    ".//span[@class='adr cn-address']//span[@class='postal-code']//text()").get()
            address = street locality code

            yield {
                'name': name,
                'mail': email,
                'telephone': telephone,
                'Fax': fax,
                'address': address
            }

Output:

{'name': 'CharlesZWILLER', 'mail': '[email protected]', 'telephone': '04 67 60 24 56', 'Fax': '04 
67 60 00 58', 'address': '24 Bd du Jeu de PaumeMONTPELLIER34000'}
2022-08-15 11:56:31 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://localhost:51142/se /session/da80a3907e6e6e78f9356f20bf4103be HTTP/1.1" 200 14
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: Remote re /session/da80a3907e6e6sponse: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-csponse: status=200 | daache'})                                                                           : 'application/json; ch
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: Finished 
Request                                                                           Request
2022-08-15 11:56:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 29687144,
 'downloader/response_count': 155,
 'downloader/response_status_count/200': 155,
 'elapsed_time_seconds': 2230.899805,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 8, 15, 18, 56, 31, 850294),
 'item_scraped_count': 1219,
 'log_count/DEBUG': 3864,
 'log_count/INFO': 37,
 'response_received_count': 155,
 'scheduler/dequeued': 155,