Home > Enterprise >  How to organize multiple Scrapy Spiders and pass data between them?
How to organize multiple Scrapy Spiders and pass data between them?

Time:11-16

I have a problem that is stopping me from progressing my project. I'll try to explain it as clearly as I can, but I am relatively new to scraping.

  • I want to scrape articles from Website A.

  • Website A doesn't have articles' content in its HTML code but links to articles on other websites (lets say Website B and Website C)

  • I have created a Spider that extracts links from Website A and yields them in JSON format.

  • I want to take the extracted links from Website A and scrape the articles from Websites B and C.

Now - I want to create separate Spiders for Website B and Website C (to use them later for scraping those websites directly and not through Website A) and somehow pass scraped data from Website A as arguments to them - but the "somehow" part is what I need your help with.

Thank you :)

EDIT

Anwsering jqc - since I posted my questions I made some developments - this is my code so far.

class QuotesSpider(scrapy.Spider):
    name = 'Website A Spider'

    start_urls = ['start_url']

    def parse(self, response):
        self.logger.info('###### Link Parser ######')
        important_news = response.xpath('//div[contains(@class, "importantNews")]//div[contains(@class, "items")]/a')

        for news in important_news:
            yield {

                'link': news.xpath('./@href').get(),
                'title': news.xpath('.//span[contains(@class, "title")]/text()').get()

            }

            article_url = news.xpath('./@href').get()
            self.logger.info('FOLLOWING URL OF THE ARTICLE')

            if 'Website B' in article_url:
                yield response.follow(article_url, callback=self.parse_Website_B)
            else:
                pass

    def parse_Website_B(self, response):

            yield {
                'Website B article title': response.xpath('//p[contains(@class, "Header_desktopTextElement")]').get()

            }

Don't worry about unfinished parsing, that's the least concerning part :)

Right now I am creating separate methods to parse particular websites, but I don't know if that is the optimal way.

CodePudding user response:

I would like to see the URL you are trying to crawl. Then I can make some tests and try to decipher your question. I can give you some hints, I am not sure if I understand you. If you want to scrape the given URLs from A, you can do it directly into:

def parse_Website_B(self, response):

        yield {
            'Website B article title': response.xpath('//p[contains(@class, "Header_desktopTextElement")]').get()

        }

You just have to yield the links, I would try with start_requests. Have a look in the documentation here.

If you provide the URL, we can try otherwise.

cheers

  • Related