I want to scrape multiple urls stored in a csv file using Scrapy. My code works(shows no error) but it only scrapes the last url, but not all of them. Here is a picture of my code. Plz tell me what I'm doing wrong. I want to scrape all the urls and save the scraped text together. I have already tried a lot of the suggestions found on StackOverflow. My code-
import scrapy
from scrapy import Request
from ..items import personalprojectItem
class ArticleSpider(scrapy.Spider):
name = 'articles'
with open('C:\\Users\\Admin\\Documents\\Bhavya\\input_urls.csv') as file:
for line in file:
start_urls = line
def start_requests(self):
request = Request(url=self.start_urls)
yield request
def parse(self, response):
item = personalprojectItem()
article = response.css('div p::text').extract()
item['article'] = article
yield item
CodePudding user response:
Below is a minimal example of how you can include a list of urls from file in a scrapy project.
We have a text file with the following links, inside the scrapy project folder:
https://www.theguardian.com/technology/2022/nov/18/elon-musk-twitter-engineers-workers-mass-resignation
https://www.theguardian.com/world/2022/nov/18/iranian-protesters-set-fire-to-ayatollah-khomeinis-ancestral-home
https://www.theguardian.com/world/2022/nov/18/canada-safari-park-shooting-animals-two-charged
The spider code looks like this (again, minimal example):
import scrapy
class GuardianSpider(scrapy.Spider):
name = 'guardian'
allowed_domains = ['theguardian.com']
start_urls = [x for x in open('urls_list.txt', 'r').readlines()]
def parse(self, response):
title = response.xpath('//h1/text()').get()
header = response.xpath('//div[@data-gu-name="standfirst"]//p/text()').get()
yield {
'title': title,
'header': header
}
If we run the spider with scrapy crawl guardian -o guardian_news.json
, we get a JSON file looking like this:
[
{"title": "Elon Musk summons Twitter engineers amid mass resignations and puts up poll on Trump ban", "header": "Reports show nearly 1,200 workers left company after demand for \u2018long hours at high intensity\u2019, while Musk starts poll on whether to reinstate Donald Trump"},
{"title": "Iranian protesters set fire to Ayatollah Khomeini\u2019s ancestral home", "header": "Social media images show what is now a museum commemorating the Islamic Republic founder ablaze as protests continue"},
{"title": "Two Canadian men charged with shooting animals at safari park", "header": "Mathieu Godard and Jeremiah Mathias-Polson accused of breaking into Parc Omega in Quebec and killing three wild boar and an elk"}
]
Scrapy documentation can be found here: https://docs.scrapy.org/en/latest/