Home > database >  Scrapy / Xpath not working to get href-element?
Scrapy / Xpath not working to get href-element?

Time:11-18

i try to scrape something from this site and working in the scrapy shell; https://www.tripadvisor.co.uk/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html

On the site i have the following part of the code and i want to get the href-informations for all of this three a-elements:

<div class="fvqxY f dlzPP">
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"
      href="http://www.blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Visit website</span><svg viewBox="0 0 24 24"
        width="16px" height="16px" class="fecdL d Vb wQMPa">
        <path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path>
      </svg></a></div>
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
      href="tel:+44 1253 830830"><span class="WlYyy cacGK Wb">Call</span></a></div>
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
      href="mailto:[email protected]"><span class="WlYyy cacGK Wb">Email</span></a></div>
</div>

I tried it with this xpath - which works for me fine in the chrome-inspector - but i only get an empty result

>>> response.xpath("//div[@class='Lvkmj']//ancestor::a/@href") 
[] 

I also checked the first div with class = "Lvkmj" and get this result:

>>> response.xpath("//div[@class='Lvkmj']").get()                                                                   s="WlYyy cacGK Wb">Visit website</s
'<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"><span clas 8.293-8.293H7.854v-2h10v10h-2V7.56s="WlYyy cacGK Wb">Visit website</span><svg viewbox="0 0 24 24" width="16px" height="16px" class="fecdL d Vb wQMPa"><path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path></svg></a></div>'
>>>

There i realized that at the first glance its the whole div-element - but then i saw that it looks exact the same like in the inspecto but for whatever reason the href-element is missing.

Why is the href-element missing when using the scapy shell in that case?

Below you can find the full code -

import scrapy

class ZoosSpider(scrapy.Spider):
  name = 'zoos'
  allowed_domains = ['www.tripadvisor.co.uk']
  start_urls = [
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html",
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
              ]

  def parse(self, response):
    tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
    for elem in tmpSEC:
      link = response.urljoin(elem.xpath(".//a/@href").get())   
      yield response.follow(link, callback=self.parseDetails)             

  def parseDetails(self, response):
    tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()  
    tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()
    tmpErg = response.xpath("//div[@class='dlzPP']//ancestor::div[@class='WlYyy diXIH dDKKM']/text()").getall()
    
    yield {
      "cat": tmpErg[1],
      "link": tmpLink,
      "name": tmpName ,
    }

CodePudding user response:

Your XPath:

//div[@class='Lvkmj']//ancestor::a/@href

shows results ...because your second // tells the XPath engine:find any descendant node of the current node and then ancestor::a tells the engine find any ancestor-element named a. And because the a's do have descendants your XPath gives result....BUT there is a much better way: Just use:

//div[@class='Lvkmj']/a/@href

/a means: just give me the direct child named a of the div[@class='Lvkmj']

But this does not solve your problem.

Your question: Why is the href-element missing when using the scapy shell in that case?

Because I think it is using only the source of the document not the updated(by javascript) dom.

And if it would use the updated dom your line

tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()

returns an array of strings. So you have to loop threw the results or if you are only interested in the first result use:

tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href ").get()

CodePudding user response:

Because of the answer of @Siebe Jongebloed (no results - cause there seems to happen some javascript dom-changes) i tried scrapy_selenium to get the data -

So i changed the code to this:

import scrapy
from shutil import which

from scrapy_selenium import SeleniumRequest

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = r"C:\Users\Polzi\Documents\DEV\Python-Private\chromedriver.exe"
SELENIUM_DRIVER_ARGUMENTS=['--headless', "--no-sandbox", "--disable-gpu"]
DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

class ZoosSpider(scrapy.Spider):
  name = 'zoos'
  allowed_domains = ['www.tripadvisor.co.uk']
  start_urls = [
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
                ]  
  existList = []  

  def parse(self, response):
    tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
    for elem in tmpSEC:
      link = response.urljoin(elem.xpath(".//a/@href").get())   
      yield SeleniumRequest(url=link, callback=self.parseDetails)  

  def parseDetails(self, response):
    tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()  
    tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()    
    
    yield {
      "name": tmpName ,
      "HREFs": tmpLink
    }

But the HREFs-result list is still empty...

  • Related