Get the text associated with a href element in a given page in scrapy-CodePudding

Currently my 'yield' in my scrapy spider looks as follows :

yield {
        'hreflink':mylink,
        'Parentlink':response.url
            }

This returns me a dict

 {
    'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
    'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/"
    }

Now, I also want the 'text' that is associated with this particular hreflink, in that particular Parentlink. So my final output should look like

 {
    'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
    'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/",
     'Yourtext' : "Download Pricing Info"
    }

What would be the simplest way to achieve that. I want to use Xpath expressions to get the "text" in a parentlink where href element = @href .

So far Here is what I tied - Yourtext = response.xpath('//a[@href=' json.dumps(each) ']//text()').get() but its not printing anything. I tried printing my response and it returns the right page - 'https://www.southeasthealth.org/financial-information-price-transparency/'

CodePudding user response：

If I understand you correctly you want to get the text belonging to the link Download Pricing Info.

I suggest you try using:

response.xpath("//span[@class='fusion-button-text']//text()").get()

CodePudding user response：

I found the answer to my question.

'//a[@href=' json.dumps(each) ']//text()'

This is the correct expression however the href link 'each' is case sensitive and it needs to match exactly for this Xpath to work.