Home > Software engineering >  Scrapy xpath not working - only in combination with css-selector?
Scrapy xpath not working - only in combination with css-selector?

Time:11-17

i try to scrape the following site with scrapy and try something with the scrapy shell -

This is the basis spider:

import scrapy

class ZoosSpider(scrapy.Spider):
    name = 'zoos'
    allowed_domains = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']
    start_urls = ['http://https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html/']

    def parse(self, response):
        tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
        for elem in tmpSEC:
          pass

I get all relevant sections with this xpath: (when i try len(tmpSEC) i get 30 which seems ok for me)

tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")

Now i want to extract the very first href-tag and tried it with this xpath: (but with that i only get "/" as result)

>>> tmpSEC[0].xpath("//a/@href").get()  
'/'

and also with

>>> tmpSEC[0].xpath("(//a)[1]/@href").get()  
'/'

but only with an css-selector this is working fine

>>> tmpSEC[0].css("a::attr(href)").get() 
'/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html'

Why is this only working with an css-selector and not with an xpath-selector?

CodePudding user response:

Here is the working solution using xpath. You need to inject dot(.) like as follows:

import scrapy


class ZoosSpider(scrapy.Spider):
    name = 'zoos'
    
    start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html/']

    def parse(self, response):
        tmpSEC = response.xpath(
            "//section[@data-automation='AppPresentation_SingleFlexCardSection']")
        #for elem in tmpSEC:
        yield {
            'link':tmpSEC[0].xpath(".//a/@href").get() 
            }   

Output:

{'link': '/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html'}
  • Related