Home > Software engineering >  How to scrape Ask engine search results using scrapy?
How to scrape Ask engine search results using scrapy?

Time:06-04

I created a spider to scrape Ask search results from a set of user-defined keywords. But, whenever I run the command scrapy crawl pageSearch -o test.json this creates an empty test.json file for me, I don't know why. To create this api, I was inspired by a developer page that showed how to scrape google SERPs and also the tutorial from the official scrapy documentation. Here is a git of what I get from the command line. I searched for the solution from stack Overflow questions but without success. I believed from the following command prompt line:'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware' that it was an http error, but after searching the internet it was not and according to my terminal my robot ran successfully and the url address that I specified in my code where the scraping was to take place is valid. And personally, I don't see my mistake so I'm lost. Here is my code :

import scrapy
import json
import datetime


class PagesearchSpider(scrapy.Spider):
    name = 'pageSearch'

    def start_requests(self):
        queries = [ 'love']
        for query in queries:
            url = 'https://www.ask.com/web?q=' query
            yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})

    def parse(self, response):
           print(response.text)
           di = json.loads(response.text)
           pos = response.meta['pos']
           dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
           for result in di['organic_results']:
               title = result['a.PartialSearchResults-item-title-link.result-link']
               snippet = result['p.PartialSearchResults-item-abstract']
               link = result['div.PartialSearchResults-item-url']
               item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
               pos  = 1
               yield item

               next_page = di['pagination']['nextPageUrl']
               if next_page:
                   yield scrapy.Request(next_page, callback=self.parse, meta={'pos': pos})

                   #scrapy crawl pageSearch -o test.json

I use Windows 10. Also, I ask for your help, thank you!

CodePudding user response:

I found two problems:


First:

In your output you can see

[scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.ask.com/web?q=love>

which means it reads https://www.ask.com/robots.txt and there is rule

User-agent: *
Disallow: /web

and scrapy respects it and it skips url https://www.ask.com/web?q=love

You have to set ROBOTTXT_OBEY = False in your settings.py to turn it off.

Scrapy doc: ROBOTTXT_OBEY


Second:

You use di = json.loads(response.text) which means you expect JSON data but this page sends HTML and you have to use functions response.css(...), response.xpath(...) and .get(), .attrib.get(...), etc.

Scrapy doc: Selectors


Working code:

You can put all code in one file script.py and run as python script.py without creating project. It will also automatically save results in test.json without using -o test.json

import scrapy
import datetime

class PagesearchSpider(scrapy.Spider):

    name = 'pageSearch'

    def start_requests(self):
        queries = [ 'love']
        for query in queries:
            url = 'https://www.ask.com/web?q=' query
            yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})

    def parse(self, response):
        print('url:', response.url)
        
        start_pos = response.meta['pos']
        print('start pos:', start_pos)

        dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')    
        
        items = response.css('div.PartialSearchResults-item')
        
        for pos, result in enumerate(items, start_pos 1):
            yield {
                'title':    result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(), 
                'snippet':  result.css('p.PartialSearchResults-item-abstract::text').get().strip(), 
                'link':     result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'), 
                'position': pos, 
                'date':     dt,
            }

        # --- after loop ---
        
        next_page = response.css('.PartialWebPagination-next a')
        
        if next_page:
            url = next_page.attrib.get('href')
            print('next_page:', url)  # relative URL
            # use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
            yield response.follow(url, callback=self.parse, meta={'pos': pos 1})


# --- run without project, and save in file ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'test.json': {'format': 'json'}},
    #'ROBOTSTXT_OBEY': True,  # this stop scraping
})
c.crawl(PagesearchSpider)
c.start() 
  • Related