Home > Software engineering >  Python webscraping issue with try yield, removing previous data
Python webscraping issue with try yield, removing previous data

Time:08-05

I am facing an issue in python when using scrapy and the try yield commands.

If I run the below script just on 'Name' it returns the full list of names on that page as expected, however when I add in the price to the script then for the out of stock items there is no price so it returns 'no price' as expected but also removes the name from the output. I do not really understand why it is doing this, I have added a screenshot of the 2 executions below (one where I just run with name and the other where I run the code with both name and price)

import scrapy
class TescoSpider(scrapy.Spider):
    name = 'tesco'
    start_urls = ['https://www.tesco.com/groceries/en-GB/shop/health-and-beauty/deodorants/all']
    def parse(self, response):
        for products in response.css('li.product-list--list-item'):
            try:
                yield {
                    'name': products.css('span.styled__Text-sc-1xbujuz-1.ldbwMG.beans-link__text::text').get(),
                    'price': products.css('p.styled__StyledHeading-sc-119w3hf-2.jWPEtj.styled__Text-sc-8qlq5b-1.lnaeiZ.beans-price__text::text').get().replace('£',''),
                }
            except:
                yield {
                    'name': 'no name',
                    'price': 'no price',
                }

Below

Output file to see issue

CodePudding user response:

It is because when the price for an item is absent then the element that matches the css selector you use to capture the price is also not there:

products.css('p.styled__StyledHeading-sc-119w3hf-2.jWPEtj.styled__Text-sc-8qlq5b-1.lnaeiZ.beans-price__text::text')

This causes the selector to fail and when a selector fails it always returns None.

So when you call products.css('p.styled__StyledHeading-sc-119w3hf-2.jWPEtj.styled__Text-sc-8qlq5b-1.lnaeiZ.beans-price__text::text').get() the return value is None.

Then you immediately call .replace('£',''), so it raises an exception because None is not a string and therefore has no replace method.

To fix this without having to use a try & except block, you just need to evaluate the price in different steps.

For example:

import scrapy


class TescoSpider(scrapy.Spider):
    name = 'tesco'
    start_urls = ['https://www.tesco.com/groceries/en-GB/shop/health-and-beauty/deodorants/all']
    def parse(self, response):
        for products in response.css('li.product-list--list-item'):
            price = products.css('p.styled__StyledHeading-sc-119w3hf-2.jWPEtj.styled__Text-sc-8qlq5b-1.lnaeiZ.beans-price__text::text').get()
            yield {            
                'name': products.css('span.styled__Text-sc-1xbujuz-1.ldbwMG.beans-link__text::text').get(),
                'price': price.replace('£','') if price else None
            }
            
  • Related