Currently my 'yield' in my scrapy spider looks as follows :
yield {
'hreflink':mylink,
'Parentlink':response.url
}
This returns me a dict
{
'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/"
}
Now, I also want the 'text' that is associated with this particular hreflink, in that particular Parentlink. So my final output should look like
{
'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/",
'Yourtext' : "Download Pricing Info"
}
What would be the simplest way to achieve that. I want to use Xpath expressions to get the "text" in a parentlink where href element = @href .
So far Here is what I tied - Yourtext = response.xpath('//a[@href=' json.dumps(each) ']//text()').get() but its not printing anything. I tried printing my response and it returns the right page - 'https://www.southeasthealth.org/financial-information-price-transparency/'
CodePudding user response:
If I understand you correctly you want to get the text belonging to the link Download Pricing Info
.
I suggest you try using:
response.xpath("//span[@class='fusion-button-text']//text()").get()
CodePudding user response:
I found the answer to my question.
'//a[@href=' json.dumps(each) ']//text()'
This is the correct expression however the href link 'each' is case sensitive and it needs to match exactly for this Xpath to work.