Home > Software design >  loop through multiple URLs to scrape from a CSV file in Scrapy is not working
loop through multiple URLs to scrape from a CSV file in Scrapy is not working

Time:11-11

When i try to execute this loop i got error please help i wanted to scrap multiple links using csv file but is stucks in start_urls i am using scrapy 2.5 and python 3.9.7

from scrapy import Request
from scrapy.http import request
import pandas as pd


class PagedataSpider(scrapy.Spider):
    name = 'pagedata'
    allowed_domains = ['www.imdb.com']

    def start_requests(self):
        df = pd.read_csv('list1.csv')
        #Here fileContainingUrls.csv is a csv file which has a column named as 'URLS'
        # contains all the urls which you want to loop over. 
        urlList = df['link'].values.to_list()
        for i in urlList:
            yield scrapy.Request(url = i, callback=self.parse)

error:

2021-11-09 22:06:45 [scrapy.core.engine] INFO: Spider opened
2021-11-09 22:06:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-11-09 22:06:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-11-09 22:06:45 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "C:\Users\Vivek\Desktop\Scrapy\myenv\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
    request = next(slot.start_requests)
  File "C:\Users\Vivek\Desktop\Scrapy\moviepages\moviepages\spiders\pagedata.py", line 18, in start_requests
    urlList = df['link'].values.to_list()
AttributeError: 'numpy.ndarray' object has no attribute 'to_list'
2021-11-09 22:06:45 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-09 22:06:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.007159,
 'finish_reason': 'finished',

CodePudding user response:

The error you received is rather straightforward; a numpy array doesn't have a to_list method.

Instead you should simply iterate over the numpy array:

from scrapy.http import request
import pandas as pd


class PagedataSpider(scrapy.Spider):
    name = 'pagedata'
    allowed_domains = ['www.imdb.com']

    def start_requests(self):
        df = pd.read_csv('list1.csv')

        urls = df['link']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
  • Related