i try to scrape the following site with scrapy and try something with the scrapy shell -
This is the basis spider:
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']
start_urls = ['http://https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html/']
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
pass
I get all relevant sections with this xpath: (when i try len(tmpSEC) i get 30 which seems ok for me)
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection'
]")
Now i want to extract the very first href-tag and tried it with this xpath: (but with that i only get "/" as result)
>>> tmpSEC[0].xpath("//a/@href").get()
'/'
and also with
>>> tmpSEC[0].xpath("(//a)[1]/@href").get()
'/'
but only with an css-selector this is working fine
>>> tmpSEC[0].css("a::attr(href)").get()
'/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html'
Why is this only working with an css-selector and not with an xpath-selector?
CodePudding user response:
Here is the working solution using xpath. You need to inject dot(.) like as follows:
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html/']
def parse(self, response):
tmpSEC = response.xpath(
"//section[@data-automation='AppPresentation_SingleFlexCardSection']")
#for elem in tmpSEC:
yield {
'link':tmpSEC[0].xpath(".//a/@href").get()
}
Output:
{'link': '/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html'}