When i try to execute this loop i got error please help i wanted to scrap multiple links using csv file but is stucks in start_urls i am using scrapy 2.5 and python 3.9.7
from scrapy import Request
from scrapy.http import request
import pandas as pd
class PagedataSpider(scrapy.Spider):
name = 'pagedata'
allowed_domains = ['www.imdb.com']
def start_requests(self):
df = pd.read_csv('list1.csv')
#Here fileContainingUrls.csv is a csv file which has a column named as 'URLS'
# contains all the urls which you want to loop over.
urlList = df['link'].values.to_list()
for i in urlList:
yield scrapy.Request(url = i, callback=self.parse)
error:
2021-11-09 22:06:45 [scrapy.core.engine] INFO: Spider opened
2021-11-09 22:06:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-11-09 22:06:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-11-09 22:06:45 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "C:\Users\Vivek\Desktop\Scrapy\myenv\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
request = next(slot.start_requests)
File "C:\Users\Vivek\Desktop\Scrapy\moviepages\moviepages\spiders\pagedata.py", line 18, in start_requests
urlList = df['link'].values.to_list()
AttributeError: 'numpy.ndarray' object has no attribute 'to_list'
2021-11-09 22:06:45 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-09 22:06:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.007159,
'finish_reason': 'finished',
CodePudding user response:
The error you received is rather straightforward; a numpy array doesn't have a to_list
method.
Instead you should simply iterate over the numpy array:
from scrapy.http import request
import pandas as pd
class PagedataSpider(scrapy.Spider):
name = 'pagedata'
allowed_domains = ['www.imdb.com']
def start_requests(self):
df = pd.read_csv('list1.csv')
urls = df['link']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)