I am a beginner at using Scrapy and I was trying to scrape this website https://directory.ntschools.net/#/schools which is using javascript to load the contents. So I checked the networks tab and there's an API address available https://directory.ntschools.net/api/System/GetAllSchools If you open this address, the data is in XML format. But when you check the response tab while inspecting the network tab, the data is there in json format.
I first tried using Scrapy, sent the request to the API address WITHOUT any headers and the response that it returned was in XML which was throwing JSONDecode error upon using json.loads(). So I used the header 'Accept' : 'application/json' and the response I got was in JSON. That worked well
import scrapy
import json
import requests
class NtseSpider_new(scrapy.Spider):
name = 'ntse_new'
header = {
'Accept': 'application/json',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.56',
}
def start_requests(self):
yield scrapy.Request('https://directory.ntschools.net/api/System/GetAllSchools',callback=self.parse,headers=self.header)
def parse(self,response):
data = json.loads(response.body) #returned json response
But then I used the requests module WITHOUT any headers and the response I got was in JSON too!
import requests
import json
res = requests.get('https://directory.ntschools.net/api/System/GetAllSchools')
js = json.loads(res.content) #returned json response
Can anyone please tell me if there's any difference between both the types of requests? Is there a default response format for requests module when making a request to an API? Surely, I am missing something? Thanks
CodePudding user response:
It's because Scrapy sets the Accept
header to 'text/html,application/xhtml xml,application/xml ...'. You can see that from this.
I experimented and found that server sends a JSON response if the request has no Accept
header.