python scrapy need assitance. I want to save to a (.csv) file. How can I do this?-CodePudding

I'm using debian Bullseye (11.2) I want to save to a (.csv) file. How can I do this?

from scrapy.spiders import CSVFeedSpider


class CsSpiderSpider(CSVFeedSpider):
    name = 'cs_spider'
    allowed_domains = ['ocw.mit.edu/courses/electrical-engineering-and-computer-science/']
    start_urls = ['http://ocw.mit.edu/courses/electrical-engineering-and-computer-science//feed.csv']
    # headers = ['id', 'name', 'description', 'image_link']
    # delimiter = '\t'

    # Do any adaptations you need here
    #def adapt_response(self, response):
    #    return response

    def parse_row(self, response, row):
        i = {}
        #i['url'] = row['url']
        #i['name'] = row['name']
        #i['description'] = row['description']
        return i

CodePudding user response：

One of the default libraries included with every python installation is csv

You could use csv.writer() to create and write a csv file without issues.

CodePudding user response：

Here's an example of using the FEEDS export from scrapy.

import scrapy
from scrapy.crawler import CrawlerProcess


class CsspiderSpider(scrapy.Spider):
    name = 'cs_spider' 
    start_urls = ['http://ocw.mit.edu/courses/electrical-engineering-and-computer-science']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url, callback = self.parse_row
            )

    def parse_row(self, response):
        yield {
            'test':response.text
        }

process = CrawlerProcess(
    settings = {
        'FEEDS':{
            'data.csv':{
                'format':'csv'
            }
        }
    }
)
process.crawl(CsspiderSpider)
process.start()

Will save the output of your file into .csv format. Furthermore, To specify columns to export and their order use FEED_EXPORT_FIELDS. You can read more about this in the docs

In the command line you can run:

scrapy crawl cs_spider -o output.csv

However, when running the above in the command line make sure to comment out all the code from process and below.