Home > Back-end >  Scrape data from API that returns 500 in Beautifulsoup
Scrape data from API that returns 500 in Beautifulsoup

Time:10-19

I'm stuck here for hours now. I'm trying scrape data from this link.

Primary aim is to get the school names and their email IDs. In order to do that, I need to get the list of schools first which here is not loading because of the API

The data loads from an API whose link is this.
The request method is POST and upon sending a POST request, returns 500.

Proof of work:

def data_fetch("https://scholenopdekaart.nl/api/v1/search/"):
    headers = {
        'accept': 'application/json, text/plain, */*',
        'content-type': 'application/json',
        'referer': 'https://scholenopdekaart.nl/zoeken/middelbare-scholen/?zoektermen=rotterdam&weergave=Lijst',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
    }
    response = requests.post(url, headers=headers)
    print(response)  # <Response [500]>  

I was able to scrape data from the other links that had GET request while POST request returns nothing. Am I missing anything right here? I even tried to put all these in the headers and still got 500.

headers = {
        'accept': 'application/json, text/plain, */*',
        'accept-encoding': 'gzip, deflate, br',
        'content-type': 'application/json',
        'dnt': '1',
        'origin': 'https://scholenopdekaart.nl',
        'referer': 'https://scholenopdekaart.nl/zoeken/middelbare-scholen/?zoektermen=rotterdam&weergave=Lijst',
        'sec-ch-ua': '"Chromium";v="106", "Brave";v="106", "Not;A=Brand";v="99"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': "Windows",
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-origin',
        'sec-gpc': '1'
    }

CodePudding user response:

I'd recommend using Google Chromes Network tab to copy the request CURL command. Then using , you can import that curl command, and generate Python Request code with all the data.

Since the response is pure JSON, no need for bs4

Consider the following example:

import requests
import json

url = "https://scholenopdekaart.nl/api/v1/search/"

payload = json.dumps({
  "zoekterm": "rotterdam",
  "sectorKeuze": 1,
  "weergave": "Lijst"
})
headers = {
  'authority': 'scholenopdekaart.nl',
  'accept': 'application/json, text/plain, */*',
  'cache-control': 'no-cache',
  'content-type': 'application/json',
  'origin': 'https://scholenopdekaart.nl',
  'pragma': 'no-cache'
}

response = requests.request("POST", url, headers=headers, data=payload)

data = json.loads(response.text)

for school in data['scholen']:
    print(f"{school['bisId']} \t\t {school['naam']}")

This will output:

25592        VSO Op Zuid
25938        Op Noord
569          Accent Praktijkonderwijs Centrum
572          Accent PRO Delfshaven
574          Marnix Gymnasium
578          Portus Zuidermavo-havo
579          Portus Juliana
580          CSG Calvijn vestiging Meerpaal
582          CBSplus, school voor havo en mavo
588          Melanchthon Schiebroek
589          Melanchthon Wilgenplaslaan
4452         Melanchthon Mavo Schiebroek
594          Melanchthon Kralingen
604          Comenius Dalton Rotterdam
26152        Zuider Gymnasium
... and some more ...

CodePudding user response:

Expanding on @0stone0's answer this works even with less code:

import requests
import json
import pandas

json_data = {
    'zoekterm': 'rotterdam',
    'sectorKeuze': 1,
    'weergave': 'Lijst',
}

response = requests.post('https://scholenopdekaart.nl/api/v1/search/', 
                          json=json_data)

data = json.loads(response.content)
df = pd.DataFrame(data["scholen"])
df[["bisId", "naam"]].head()

Output:

    bisId   naam
0   25592   VSO Op Zuid
1   25938   Op Noord
2   569     Accent Praktijkonderwijs Centrum
3   572     Accent PRO Delfshaven
4   574     Marnix Gymnasium
  • Related