my script is below, first it reads plate_num_xlsx
value from excel file data.xlsx
successfully then requests scrapy to scrape data from url. At each parse() invocation, I am taking values parsed to item
then trying to export them to excel with pandas.
if plate_num_xlsx==plate.replace(" ","").strip():
item= {"plate": plate.strip(), "price": price.strip()}
else:
item = {"plate": plate.strip(), "price": "-"}
df_output=pd.DataFrame([item],columns=["PLATE","PRICE"])
df_output.to_excel("output_res.xlsx",sheet_name="result",index=False,header=True)
Excel file output_res.xlsx
has been created successfully. But parsed data in item is not being exported to (written to) that file. What can be issue?
import scrapy
from scrapy.crawler import CrawlerProcess
import pandas as pd
class plateScraper(scrapy.Spider):
name = 'scrapePlate'
allowed_domains = ['dvlaregistrations.direct.gov.uk']
def start_requests(self):
df=pd.read_excel('data.xlsx')
columnA_values=df['PLATE']
for row in columnA_values:
global plate_num_xlsx
plate_num_xlsx=row
base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=¤tmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
url=base_url
yield scrapy.Request(url)
def parse(self, response):
for row in response.css('div.resultsstrip'):
plate = row.css('a::text').get()
price = row.css('p::text').get()
if plate_num_xlsx==plate.replace(" ","").strip():
item= {"plate": plate.strip(), "price": price.strip()}
else:
item = {"plate": plate.strip(), "price": "-"}
df_output=pd.DataFrame([item],columns=["PLATE","PRICE"])
df_output.to_excel("output_res.xlsx",sheet_name="result",index=False,header=True)
process = CrawlerProcess()
process.crawl(plateScraper)
process.start()
CodePudding user response:
The keys of dictionary item
are used as columns of your new pd.DataFrame
. As per pandas documentation, providing columns to pd.DataFrame
names columns by provided column names if data doesn't have column names already, and if data does contain column names (as it does in your case), it uses column names from pd.DataFrame method to do a column select.
As you don't have dictionary items named "PLATE" or "PRICE", the resulting dataframe is empty.
So, skip providing column names in pd.DataFrame
, and if you want to rename them, do so after creating the dataframe.
df_output=pd.DataFrame([item])
df_output.columns = ["PLATE","PRICE"]
CodePudding user response:
The allowed_domains variable is set to 'dvlaregistrations.direct.gov.uk'
but the URLs being accessed are on 'dvlaregistrations.dvla.gov.uk'
. This might cause issues as Scrapy will only follow links within the specified domain.
The parse method is overwriting the output excel file with each iteration, so only the last item will be present in the output file.
The code would fail to run because the yield scrapy.Request(url)
should be yield scrapy.Request(url, self.parse)
You are using global keyword to define a variable which is not required here. It can be defined as a instance variable.
It would also be best to use a unique name for the output file to avoid overwriting any existing files.