i try to scrape something from this site and working in the scrapy shell; https://www.tripadvisor.co.uk/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html
On the site i have the following part of the code and i want to get the href-informations for all of this three a-elements:
<div class="fvqxY f dlzPP">
<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"
href="http://www.blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Visit website</span><svg viewBox="0 0 24 24"
width="16px" height="16px" class="fecdL d Vb wQMPa">
<path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path>
</svg></a></div>
<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
href="tel:+44 1253 830830"><span class="WlYyy cacGK Wb">Call</span></a></div>
<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
href="mailto:[email protected]"><span class="WlYyy cacGK Wb">Email</span></a></div>
</div>
I tried it with this xpath - which works for me fine in the chrome-inspector - but i only get an empty result
>>> response.xpath("//div[@class='Lvkmj']//ancestor::a/@href")
[]
I also checked the first div with class = "Lvkmj" and get this result:
>>> response.xpath("//div[@class='Lvkmj']").get() s="WlYyy cacGK Wb">Visit website</s
'<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"><span clas 8.293-8.293H7.854v-2h10v10h-2V7.56s="WlYyy cacGK Wb">Visit website</span><svg viewbox="0 0 24 24" width="16px" height="16px" class="fecdL d Vb wQMPa"><path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path></svg></a></div>'
>>>
There i realized that at the first glance its the whole div-element - but then i saw that it looks exact the same like in the inspecto but for whatever reason the href-element is missing.
Why is the href-element missing when using the scapy shell in that case?
Below you can find the full code -
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['www.tripadvisor.co.uk']
start_urls = [
"https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html",
"https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
]
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
link = response.urljoin(elem.xpath(".//a/@href").get())
yield response.follow(link, callback=self.parseDetails)
def parseDetails(self, response):
tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()
tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()
tmpErg = response.xpath("//div[@class='dlzPP']//ancestor::div[@class='WlYyy diXIH dDKKM']/text()").getall()
yield {
"cat": tmpErg[1],
"link": tmpLink,
"name": tmpName ,
}
CodePudding user response:
Your XPath:
//div[@class='Lvkmj']//ancestor::a/@href
shows results ...because your second //
tells the XPath engine:find any descendant node of the current node and then ancestor::a
tells the engine find any ancestor-element named a. And because the a's do have descendants your XPath gives result....BUT there is a much better way: Just use:
//div[@class='Lvkmj']/a/@href
/a
means: just give me the direct child
named a
of the div[@class='Lvkmj']
But this does not solve your problem.
Your question: Why is the href-element missing when using the scapy shell in that case?
Because I think it is using only the source of the document not the updated(by javascript) dom.
And if it would use the updated dom your line
tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()
returns an array of strings. So you have to loop threw the results or if you are only interested in the first result use:
tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href ").get()
CodePudding user response:
Because of the answer of @Siebe Jongebloed (no results - cause there seems to happen some javascript dom-changes) i tried scrapy_selenium to get the data -
So i changed the code to this:
import scrapy
from shutil import which
from scrapy_selenium import SeleniumRequest
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = r"C:\Users\Polzi\Documents\DEV\Python-Private\chromedriver.exe"
SELENIUM_DRIVER_ARGUMENTS=['--headless', "--no-sandbox", "--disable-gpu"]
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['www.tripadvisor.co.uk']
start_urls = [
"https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
]
existList = []
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
link = response.urljoin(elem.xpath(".//a/@href").get())
yield SeleniumRequest(url=link, callback=self.parseDetails)
def parseDetails(self, response):
tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()
tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()
yield {
"name": tmpName ,
"HREFs": tmpLink
}
But the HREFs-result list is still empty...