I would expect to see "HIT" in my Visual Studio console but the process_listing
function is never executed.
When I run scrapy crawl foo -O foo.json
I get error:
start_requests = iter(self.spider.start_requests()) TypeError: 'NoneType' object is not iterable
I already checked here.
import json
import re
import os
import requests
import scrapy
import time
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import html2text
class FooSpider(scrapy.Spider):
name = 'foo'
start_urls = ['https://www.example.com/item.json?lang=en']
def start_requests(self):
r = requests.get(self.start_urls[0])
cont = r.json()
self.parse(cont)
def parse(self, response):
for o in response['objects']:
if o.get('option') == "buy" and o.get('is_available'):
listing_url = "https://www.example.com/" \
o.get('brand').lower().replace(' ','-') "-" \
o.get('model').lower() "-"
if o.get('make') is not None:
listing_url = o.get('make') "-"
listing_url = o.get('year').lower()
print(listing_url) #a valid url is printed here
yield scrapy.Request(
url=response.urljoin(listing_url),
callback=self.process_listing
)
def process_listing(self, response):
#this function is never executed
print('HIT')
yield item
I tried:
url=response.urljoin(listing_url)
url=listing_url
CodePudding user response:
Looking at the documentation for sractpy.Spider.start_requests
, we see:
This method must return an iterable with the first Requests to crawl for this spider. It is called by Scrapy when the spider is opened for scraping. Scrapy calls it only once, so it is safe to implement start_requests() as a generator.
(emphasis mine)
Your start_requests
method doesn't return anything (aka it returns None
):
def start_requests(self):
r = requests.get(self.start_urls[0])
cont = r.json()
self.parse(cont)
So when scrapy calls iter(self.spider.start_requests()
, it ends up asking for iter(None)
, and None
isn't iterable.