Home > Software design >  Scrapy returns entire webpage starting at css selector
Scrapy returns entire webpage starting at css selector

Time:12-18

I am scraping blog posts and encountered a weird issue. When extracting an entire element instead of only it's text, scrapy is returning the selected element every element/closing tag that comes after it in the webpage. For example, I have this code:

import scrapy


class postscraperSpider(scrapy.Spider):
    name = 'postscraper'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com/blog-post/']

    def parse(self, response):
        yield{
            'title': response.css('.title_container > h1.entry-title::text').get(),
            'content': response.css('div.text_1 .text_inner h2').get()
        }

When ran, title is populated with the proper text. However, content is populated by the correct response, and then every element and closing tag that comes after it.

If I attempt to extract the text, it populates fine. Like so:

    def parse(self, response):
        yield{
            'title': response.css('.title_container > h1.entry-title::text').get(),
            'content': response.css('div.text_1 .text_inner h2::text').get()
        }

The reason I cannot just extract the text, is because it won't be only h2s that I'm extracting from text_inner. I will need to extract all children, including their tags. What I really need is code that looks like this, but I felt the above better illustrated my issue:

    def parse(self, response):
        yield{
            'title': response.css('.title_container > h1.entry-title::text').get(),
            'content': response.css('div.text_1 .text_inner > *').get()
        }

Thank you for any help that you can offer.

Related: No text printed when using response.xpath() or response.css in scrapy

Also related: Python: Scrapy returning all html following element instead of just html of element

It looks like it's an environment bug. I'm going to try reinstalling Anaconda.

CodePudding user response:

Maybe you can try to use the .extract_first() instance instead of .get(). It is hard to tell if your CSS selector is correct because of the example website in the array. Try going to chrome and search the CSS selector you used and see if it returns all the closing tags and elements.

CodePudding user response:

Reinstalling python anaconda fixed this issue for me. I'm not sure what happened. I did have both python 3.8 and 3.9 installed, so it may have been a conflict between those.

  • Related