Trying to get data from excel column then start scraping by concatenating the value taken from excel to url. Script gives a TypeError raise TypeError(f"Request url must be str, got {type(url).__name__}")
Below is my script.
import scrapy
from scrapy.crawler import CrawlerProcess
import pandas as pd
plate_num_xlsx = 'LA55ERR'
base_url=[f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=¤tmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="]
class plateScraper(scrapy.Spider):
name = 'scrapePlate'
allowed_domains = ['dvlaregistrations.direct.gov.uk']
start_urls = [f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=¤tmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="]
def start_requests(self):
df=pd.read_excel('data.xlsx')
columnA_values=df['PLATE']
for row in columnA_values:
print(row)
plate_num_xlsx=row
print(plate_num_xlsx)
url=base_url
yield scrapy.Request(url)
def parse(self, response):
for row in response.css('div.resultsstrip'):
plate = row.css('a::text').get()
price = row.css('p::text').get()
if plate_num_xlsx==plate.replace(" ","").strip():
print(plate.replace(" ", ""))
yield {"plate": plate.strip(), "price": price.strip()}
process = CrawlerProcess()
process.crawl(plateScraper)
process.start()
CodePudding user response:
The error you are encountering is due to the url variable being a list and not a string. In the start_requests method, you are creating a list base_url and then later trying to assign it to the url variable, but it should be a string. Also, when you are trying to start the request, you are passing the base_url variable, but you should be passing the url variable that you created in the loop.
Here is an updated version of the script that should work:
import scrapy
from scrapy.crawler import CrawlerProcess
import pandas as pd
class plateScraper(scrapy.Spider):
name = 'scrapePlate'
allowed_domains = ['dvlaregistrations.direct.gov.uk']
def start_requests(self):
df=pd.read_excel('data.xlsx')
columnA_values=df['PLATE']
for row in columnA_values:
plate_num_xlsx=row
base_url=f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=¤tmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
yield scrapy.Request(base_url, self.parse)
def parse(self, response):
for row in response.css('div.resultsstrip'):
plate = row.css('a::text').get()
price = row.css('p::text').get()
if plate_num_xlsx==plate.replace(" ","").strip():
print(plate.replace(" ", ""))
yield {"plate": plate.strip(), "price":price.strip()}
process = CrawlerProcess()
process.crawl(plateScraper)
process.start()
Now, the script reads the value of the plate number from the excel file, and then in the start_requests method, it concatenates the plate number with the base url to form the complete url, and then starts the request.