Trying to scrape email-CodePudding

I am trying to scrape email but it give me none these is page link https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry

I am going to the network tab and check the html code from the but the email doesnot exsist in html code:

<div ><p>Contacter par email : <span id="cloak65106">Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.</span><script type='text/javascript'>

Code: import scrapy from scrapy.http import Request

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry']
    page_number = 1

    def parse(self, response):
        mail=response.xpath("//span//a[starts-with(@href, 'mailto')]/@href").get()
        yield{
            'email':mail
        }

CodePudding user response：

The webpage is static except email portion. That's why you are getting None. To grab the email, you can use scrapy with SeleniumRequest

import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest

class TestSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):

        yield SeleniumRequest(url='https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry', callback=self.parse)

    def parse(self, response):
        
        driver=response.meta['driver']
        r = Selector(text=driver.page_source)
        yield {
            'mail_link': r.xpath('//*[@]/following-sibling::div[1]/p/span/a/@href').get(),
            'mail': r.xpath('//*[@]/following-sibling::div[1]/p/span/a/text()').get()
            
        }

Output:

{'mail_link': 'mailto:[email protected]', 'mail': '[email protected]'}

You have to add the following code in settings.py file

# Middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

# Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['--headless']