Home > Back-end >  How can I extract URL of website with scrapy?
How can I extract URL of website with scrapy?

Time:06-19

I´m trying to scrap the Amazon website with Scrapy. I can easily scrap items like title of product, or price, but I have no clue how to extract the url of a product (marked in picture at the bottom). Currently my def parse function looks like that:

    def parse(self, response):

        items = BigItem()

        all_boxes = response.css('.s-widget-spacing-small > .sg-col-inner')
        for boxes in all_boxes:
            name = boxes.css('.s-link-style .a-text-normal').css('::text').extract()
            author = boxes.css('.a-color-secondary .a-size-base:nth-child(2)').css('::text').extract()
            price = boxes.css('.s-price-instructions-style .a-price-whole').css('::text').extract()
            imagelink = boxes.css('.s-image::attr(src)').extract()
            rating = boxes.css('.a-spacing-top-small .aok-align-bottom').css('::text').extract()
            valuation = boxes.css('.a-spacing-top-small .s-link-style .s-underline-text').css('::text').extract()
            link = boxes.css('a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal::attr(href)').extract()

            items['name'] = name
            items['author'] = author
            items['price'] = price
            items['imagelink'] = imagelink
            items['rating'] = rating
            items['valuation'] = valuation
            items['link'] = link

            yield items

I also tried to extract as ::text & with outer .css(::text) or .css(::href) but it´s not working.

[enter image description here][1] [1]: https://i.stack.imgur.com/f1doP.png

CodePudding user response:

Use periods in front of class names .a-link-normal

boxes.css(".a-link-normal .s-underline-text .s-underline-link-text .s-link-style .a-text-normal::attr(href)").extract():

CodePudding user response:

Use .extract_first() or .get() method

link = boxes.css('.a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal::attr(href)').get()

 items['link'] = 'https://www.amazon.de/' link
  • Related