Home > Enterprise >  Correct headers and payload for scraping a website that uses ajax
Correct headers and payload for scraping a website that uses ajax

Time:05-28

I am trying to simulate an ajax request with scrapy FormRequest to get the next page on this website https://www.the-academy.nl/trainingen. My headers look like this

headers = {
        'path': 'https://www.the-academy.nl/Page?$$ajaxid=view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:tblView',
        'authority': 'www.the-academy.nl',
        'accept-encoding': 'gzip, deflate, br',
        'content-length': '1225',
        'content-type': 'multipart/form-data'
    }

and formdata like this

formdata = {
        '$$viewid': '!1rjej6ewgse3x0h6r86gfzlst!',
        '$$xspsubmitid': 'view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:viewPager__Group__lnk__1',
        '$$xspexecid': 'view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:viewPager',
        '$$xspsubmitvalue':'',
        '$$xspsubmitscroll': '0|1272',
    }

and I am getting a response, but its the 404 page. Thank you in advance)

CodePudding user response:

  1. I use java as search term. I select only the form data those have the key-value pairs.

  2. Don't inject 'content-length' header

  3. Add method:"POST"

  4. Call FormRequest.from_response

  5. Below is an example of 200 response status

Script:

from scrapy.crawler import CrawlerProcess
import scrapy
class AspSpider(scrapy.Spider):
    name = 'asp'
    
    def start_requests(self):
        yield scrapy.FormRequest(
          
            url='https://www.the-academy.nl/zoekresultatenpagina?text=java',
            formdata= {
                'view:_id1:_id2:_id3:_id4:_id5:2:_id86:_id88:query': "",
                'view:_id1:_id2:_id3:_id4:_id5:3:_id94:_id96:query': "",
                '$viewid': '!eaie1cfxpuckx0dbjrxsxrw60!',
                '$$xspsubmitid': 'view:_id1:_id2:_id3:_id199:_id200:0:_id201:_id203:viewPager__Next',
                '$$xspexecid': 'view:_id1:_id2:_id3:_id199:_id200:0:_id201:_id203:viewPager',
                '$$xspsubmitscroll': '0|1500',
                'view:_id1': 'view:_id1',
                '$$xspsubmitvalue': ""
                },
            callback=self.parse_item,
            headers={
                'accept': '*/*',
                'accept-encoding': 'gzip, deflate, br',
                'accept-language': 'en-US,en;q=0.9',
                'content-type': 'multipart/form-data; boundary=----WebKitFormBoundary2aCYMIdAcbwx4FjO',
                'referer': 'https://www.the-academy.nl/zoekresultatenpagina?text=java'
            },
            method='POST'

            )
    def parse_item(self,response):
        pass
if __name__ == "__main__":
    process =CrawlerProcess(AspSpider)
    process.crawl()
    process.start()

Output:

 DEBUG: Crawled (200) <POST https://www3.hkexnews.hk/sdw/search/searchsdw.aspx> (referer: https://www.the-academy.nl/zoekresultatenpagina?text=java)
  • Related