Home > Mobile >  KeyError: 'driver' scrapy and selenium together
KeyError: 'driver' scrapy and selenium together

Time:08-16

They will scrape the first page when the move to second page they show KeyError: 'driver' is there any solution for these I want to create a webcrawler using scrapy-selenium. these is page link https://barreau-montpellier.com/annuaire-professionnel/?cn-s My code looks like this:

    import scrapy
    from scrapy import Selector
    from scrapy_selenium import SeleniumRequest
    
    class TestSpider(scrapy.Spider):
        name = 'test'
        page_number=1
        
        def start_requests(self):
          yield SeleniumRequest(url='https://barreau-montpellier.com/annuaire-professionnel/?cn-s=',callback=self.parse)
        
    
    
        def parse(self, response):
            driver=response.meta['driver']
            r = Selector(text=driver.page_source)
        
            details=r.xpath("//div[@class='cn-entry cn-background-gradient']")
            for detail in details:
                email=detail.xpath(".//span[@class='email cn-email-address']//a//@href").get()
                try:
                    email=email.replace("mailto:","")
                except:
                    email=''
                
                n1=detail.xpath(".//span[@class='given-name']//text()").get()
                n2=detail.xpath(".//span[@class='family-name']//text()").get()
                name=n1 n2
                
                
                telephone=detail.xpath(".//span[@class='tel cn-phone-number cn-phone-number-type-workphone']//a//text()").get()
                
                
                fax=detail.xpath(".//span[@class='tel cn-phone-number cn-phone-number-type-workfax']//a//text()").get()
    
                
                street=detail.xpath(".//span[@class='adr cn-address']//span[@class='street-address notranslate']//text()").get()
                locality=detail.xpath(".//span[@class='adr cn-address']//span[@class='locality']//text()").get()
                code=detail.xpath(".//span[@class='adr cn-address']//span[@class='postal-code']//text()").get()
                address=street locality code
                
                yield{
                    'name':name,
                    'mail':email,
                    'telephone':telephone,
                    'Fax':fax,
                    'address':address
                }
                next_page = 'https://barreau-montpellier.com/annuaire-professionnel/pg/'  str(TestSpider.page_number) '/?cn-s' 
                if TestSpider.page_number<=155:
                    TestSpider.page_number  = 1
                    yield response.follow(next_page, callback = self.parse,)

In setting .py I have added these:

from shutil import which
  
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('C:\Program Files (x86)\chromedriver.exe')
SELENIUM_DRIVER_ARGUMENTS=['--headless']  
  
DOWNLOADER_MIDDLEWARES = {
     'scrapy_selenium.SeleniumMiddleware': 800
     }

CodePudding user response:

Actually, Why are you getting key error driver? Most likely, I'm clear about it after testing your code more than once. Have you ever tested your code without pagination portion? I also got key error driver but when I get rid of the pagination part the error has gone disappeared. So for the incorrect next pages/pagination, you are getting key error driver. I've made the pagination in def start_requests(self) using range function and it's working fine without any issues plus this type of pagination is two times faster than others.

Full working code:

import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest

class TestSpider(scrapy.Spider):
    name = 'test'
    page_number = 1

    def start_requests(self):
        urls = ['https://barreau-montpellier.com/annuaire-professionnel/pg/' str(x) '/?cn-s' for x in range(1,156)]
        for url in urls:
            yield SeleniumRequest(
                url= url,
                callback=self.parse,
                wait_time=3)

    def parse(self, response):

        driver = response.meta['driver']
        r = Selector(text=driver.page_source)

        details = r.xpath(
                "//div[@class='cn-entry cn-background-gradient']")
        for detail in details:
            email = detail.xpath(
                    ".//span[@class='email cn-email-address']//a//@href").get()
            try:
                email = email.replace("mailto:", "")
            except:
                email = ''

            n1 = detail.xpath(".//span[@class='given-name']//text()").get()
            n2 = detail.xpath(
                    ".//span[@class='family-name']//text()").get()
            name = n1 n2

            telephone = detail.xpath(
                    ".//span[@class='tel cn-phone-number cn-phone-number-type-workphone']//a//text()").get()

            fax = detail.xpath(
                    ".//span[@class='tel cn-phone-number cn-phone-number-type-workfax']//a//text()").get()

            street = detail.xpath(
                    ".//span[@class='adr cn-address']//span[@class='street-address notranslate']//text()").get()
            locality = detail.xpath(
                    ".//span[@class='adr cn-address']//span[@class='locality']//text()").get()
            code = detail.xpath(
                    ".//span[@class='adr cn-address']//span[@class='postal-code']//text()").get()
            address = street locality code

            yield {
                'name': name,
                'mail': email,
                'telephone': telephone,
                'Fax': fax,
                'address': address
            }

   

Output:

{'name': 'CharlesZWILLER', 'mail': '[email protected]', 'telephone': '04 67 60 24 56', 'Fax': '04 
67 60 00 58', 'address': '24 Bd du Jeu de PaumeMONTPELLIER34000'}
2022-08-15 11:56:31 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://localhost:51142/se /session/da80a3907e6e6e78f9356f20bf4103be HTTP/1.1" 200 14
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: Remote re /session/da80a3907e6e6sponse: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-csponse: status=200 | daache'})                                                                           : 'application/json; ch
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: Finished 
Request                                                                           Request
2022-08-15 11:56:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 29687144,
 'downloader/response_count': 155,
 'downloader/response_status_count/200': 155,
 'elapsed_time_seconds': 2230.899805,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 8, 15, 18, 56, 31, 850294),
 'item_scraped_count': 1219,
 'log_count/DEBUG': 3864,
 'log_count/INFO': 37,
 'response_received_count': 155,
 'scheduler/dequeued': 155,
  • Related