I have a problem that is stopping me from progressing my project. I'll try to explain it as clearly as I can, but I am relatively new to scraping.
I want to scrape articles from Website A.
Website A doesn't have articles' content in its HTML code but links to articles on other websites (lets say Website B and Website C)
I have created a Spider that extracts links from Website A and yields them in JSON format.
I want to take the extracted links from Website A and scrape the articles from Websites B and C.
Now - I want to create separate Spiders for Website B and Website C (to use them later for scraping those websites directly and not through Website A) and somehow pass scraped data from Website A as arguments to them - but the "somehow" part is what I need your help with.
Thank you :)
EDIT
Anwsering jqc - since I posted my questions I made some developments - this is my code so far.
class QuotesSpider(scrapy.Spider):
name = 'Website A Spider'
start_urls = ['start_url']
def parse(self, response):
self.logger.info('###### Link Parser ######')
important_news = response.xpath('//div[contains(@class, "importantNews")]//div[contains(@class, "items")]/a')
for news in important_news:
yield {
'link': news.xpath('./@href').get(),
'title': news.xpath('.//span[contains(@class, "title")]/text()').get()
}
article_url = news.xpath('./@href').get()
self.logger.info('FOLLOWING URL OF THE ARTICLE')
if 'Website B' in article_url:
yield response.follow(article_url, callback=self.parse_Website_B)
else:
pass
def parse_Website_B(self, response):
yield {
'Website B article title': response.xpath('//p[contains(@class, "Header_desktopTextElement")]').get()
}
Don't worry about unfinished parsing, that's the least concerning part :)
Right now I am creating separate methods to parse particular websites, but I don't know if that is the optimal way.
CodePudding user response:
I would like to see the URL you are trying to crawl. Then I can make some tests and try to decipher your question. I can give you some hints, I am not sure if I understand you. If you want to scrape the given URLs from A, you can do it directly into:
def parse_Website_B(self, response):
yield {
'Website B article title': response.xpath('//p[contains(@class, "Header_desktopTextElement")]').get()
}
You just have to yield the links, I would try with start_requests. Have a look in the documentation here.
If you provide the URL, we can try otherwise.
cheers