I created a spider to scrape Ask search results from a set of user-defined keywords. But, whenever I run the command scrapy crawl pageSearch -o test.json
this creates an empty test.json file for me, I don't know why. To create this api, I was inspired by a developer page that showed how to scrape google SERPs and also the tutorial from the official scrapy documentation. Here is a git of what I get from the command line. I searched for the solution from stack Overflow questions but without success. I believed from the following command prompt line:'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware'
that it was an http error, but after searching the internet it was not and according to my terminal my robot ran successfully and the url address that I specified in my code where the scraping was to take place is valid. And personally, I don't see my mistake so I'm lost. Here is my code :
import scrapy
import json
import datetime
class PagesearchSpider(scrapy.Spider):
name = 'pageSearch'
def start_requests(self):
queries = [ 'love']
for query in queries:
url = 'https://www.ask.com/web?q=' query
yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})
def parse(self, response):
print(response.text)
di = json.loads(response.text)
pos = response.meta['pos']
dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for result in di['organic_results']:
title = result['a.PartialSearchResults-item-title-link.result-link']
snippet = result['p.PartialSearchResults-item-abstract']
link = result['div.PartialSearchResults-item-url']
item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
pos = 1
yield item
next_page = di['pagination']['nextPageUrl']
if next_page:
yield scrapy.Request(next_page, callback=self.parse, meta={'pos': pos})
#scrapy crawl pageSearch -o test.json
I use Windows 10. Also, I ask for your help, thank you!
CodePudding user response:
I found two problems:
First:
In your output you can see
[scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.ask.com/web?q=love>
which means it reads https://www.ask.com/robots.txt and there is rule
User-agent: *
Disallow: /web
and scrapy respects it and it skips url https://www.ask.com/web?q=love
You have to set ROBOTTXT_OBEY = False
in your settings.py to turn it off.
Scrapy doc: ROBOTTXT_OBEY
Second:
You use di = json.loads(response.text)
which means you expect JSON
data but this page sends HTML
and you have to use functions response.css(...)
, response.xpath(...)
and .get()
, .attrib.get(...)
, etc.
Scrapy doc: Selectors
Working code:
You can put all code in one file script.py
and run as python script.py
without creating project. It will also automatically save results in test.json
without using -o test.json
import scrapy
import datetime
class PagesearchSpider(scrapy.Spider):
name = 'pageSearch'
def start_requests(self):
queries = [ 'love']
for query in queries:
url = 'https://www.ask.com/web?q=' query
yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})
def parse(self, response):
print('url:', response.url)
start_pos = response.meta['pos']
print('start pos:', start_pos)
dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
items = response.css('div.PartialSearchResults-item')
for pos, result in enumerate(items, start_pos 1):
yield {
'title': result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(),
'snippet': result.css('p.PartialSearchResults-item-abstract::text').get().strip(),
'link': result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'),
'position': pos,
'date': dt,
}
# --- after loop ---
next_page = response.css('.PartialWebPagination-next a')
if next_page:
url = next_page.attrib.get('href')
print('next_page:', url) # relative URL
# use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
yield response.follow(url, callback=self.parse, meta={'pos': pos 1})
# --- run without project, and save in file ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'test.json': {'format': 'json'}},
#'ROBOTSTXT_OBEY': True, # this stop scraping
})
c.crawl(PagesearchSpider)
c.start()