I am trying to scrape email but it give me none
these is page link https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry
I am going to the network tab
and check the html code
from the but the email doesnot exsist in html code:
<div ><p>Contacter par email : <span id="cloak65106">Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.</span><script type='text/javascript'>
Code: import scrapy from scrapy.http import Request
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry']
page_number = 1
def parse(self, response):
mail=response.xpath("//span//a[starts-with(@href, 'mailto')]/@href").get()
yield{
'email':mail
}
CodePudding user response:
The webpage is static except email
portion. That's why you are getting None. To grab the email, you can use scrapy with SeleniumRequest
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
class TestSpider(scrapy.Spider):
name = 'test'
def start_requests(self):
yield SeleniumRequest(url='https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry', callback=self.parse)
def parse(self, response):
driver=response.meta['driver']
r = Selector(text=driver.page_source)
yield {
'mail_link': r.xpath('//*[@]/following-sibling::div[1]/p/span/a/@href').get(),
'mail': r.xpath('//*[@]/following-sibling::div[1]/p/span/a/text()').get()
}
Output:
{'mail_link': 'mailto:[email protected]', 'mail': '[email protected]'}
You have to add the following code in settings.py file
# Middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
# Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['--headless']