scrapy python project does not export data to excel with pandas-CodePudding

my script is below, first it reads plate_num_xlsx value from excel file data.xlsx successfully then requests scrapy to scrape data from url. At each parse() invocation, I am taking values parsed to item then trying to export them to excel with pandas.

        if plate_num_xlsx==plate.replace(" ","").strip():
            item= {"plate": plate.strip(), "price": price.strip()}
        else:
            item = {"plate": plate.strip(), "price": "-"}

    df_output=pd.DataFrame([item],columns=["PLATE","PRICE"])
    df_output.to_excel("output_res.xlsx",sheet_name="result",index=False,header=True)

Excel file output_res.xlsx has been created successfully. But parsed data in item is not being exported to (written to) that file. What can be issue?

import scrapy
from scrapy.crawler import CrawlerProcess
import pandas as pd

class plateScraper(scrapy.Spider):
    name = 'scrapePlate'
    allowed_domains = ['dvlaregistrations.direct.gov.uk']

    def start_requests(self):
        df=pd.read_excel('data.xlsx')
        columnA_values=df['PLATE']
        for row in columnA_values:
            global  plate_num_xlsx
            plate_num_xlsx=row
            base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=&currentmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
            url=base_url
            yield scrapy.Request(url)

    def parse(self, response):
        for row in response.css('div.resultsstrip'):
            plate = row.css('a::text').get()
            price = row.css('p::text').get()
            if plate_num_xlsx==plate.replace(" ","").strip():
                item= {"plate": plate.strip(), "price": price.strip()}
            else:
                item = {"plate": plate.strip(), "price": "-"}
                
        df_output=pd.DataFrame([item],columns=["PLATE","PRICE"])
        df_output.to_excel("output_res.xlsx",sheet_name="result",index=False,header=True)

process = CrawlerProcess()
process.crawl(plateScraper)
process.start()

CodePudding user response：

The keys of dictionary item are used as columns of your new pd.DataFrame. As per pandas documentation, providing columns to pd.DataFrame names columns by provided column names if data doesn't have column names already, and if data does contain column names (as it does in your case), it uses column names from pd.DataFrame method to do a column select.

As you don't have dictionary items named "PLATE" or "PRICE", the resulting dataframe is empty.

So, skip providing column names in pd.DataFrame, and if you want to rename them, do so after creating the dataframe.

df_output=pd.DataFrame([item])
df_output.columns = ["PLATE","PRICE"]

CodePudding user response：

The allowed_domains variable is set to 'dvlaregistrations.direct.gov.uk' but the URLs being accessed are on 'dvlaregistrations.dvla.gov.uk'. This might cause issues as Scrapy will only follow links within the specified domain.

The parse method is overwriting the output excel file with each iteration, so only the last item will be present in the output file.

The code would fail to run because the yield scrapy.Request(url) should be yield scrapy.Request(url, self.parse)

You are using global keyword to define a variable which is not required here. It can be defined as a instance variable.

It would also be best to use a unique name for the output file to avoid overwriting any existing files.