I'm stuck here for hours now. I'm trying scrape data from this link.
Primary aim is to get the school names and their email IDs. In order to do that, I need to get the list of schools first which here is not loading because of the API
The data loads from an API whose link is this.
The request method is POST and upon sending a POST request, returns 500.
Proof of work:
def data_fetch("https://scholenopdekaart.nl/api/v1/search/"):
headers = {
'accept': 'application/json, text/plain, */*',
'content-type': 'application/json',
'referer': 'https://scholenopdekaart.nl/zoeken/middelbare-scholen/?zoektermen=rotterdam&weergave=Lijst',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
}
response = requests.post(url, headers=headers)
print(response) # <Response [500]>
I was able to scrape data from the other links that had GET request while POST request returns nothing. Am I missing anything right here? I even tried to put all these in the headers and still got 500.
headers = {
'accept': 'application/json, text/plain, */*',
'accept-encoding': 'gzip, deflate, br',
'content-type': 'application/json',
'dnt': '1',
'origin': 'https://scholenopdekaart.nl',
'referer': 'https://scholenopdekaart.nl/zoeken/middelbare-scholen/?zoektermen=rotterdam&weergave=Lijst',
'sec-ch-ua': '"Chromium";v="106", "Brave";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': "Windows",
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'sec-gpc': '1'
}
CodePudding user response:
I'd recommend using Google Chromes Network tab to copy the request CURL command. Then using postman, you can import that curl command, and generate Python Request code with all the data.
Since the response is pure JSON, no need for bs4
Consider the following example:
import requests
import json
url = "https://scholenopdekaart.nl/api/v1/search/"
payload = json.dumps({
"zoekterm": "rotterdam",
"sectorKeuze": 1,
"weergave": "Lijst"
})
headers = {
'authority': 'scholenopdekaart.nl',
'accept': 'application/json, text/plain, */*',
'cache-control': 'no-cache',
'content-type': 'application/json',
'origin': 'https://scholenopdekaart.nl',
'pragma': 'no-cache'
}
response = requests.request("POST", url, headers=headers, data=payload)
data = json.loads(response.text)
for school in data['scholen']:
print(f"{school['bisId']} \t\t {school['naam']}")
This will output:
25592 VSO Op Zuid
25938 Op Noord
569 Accent Praktijkonderwijs Centrum
572 Accent PRO Delfshaven
574 Marnix Gymnasium
578 Portus Zuidermavo-havo
579 Portus Juliana
580 CSG Calvijn vestiging Meerpaal
582 CBSplus, school voor havo en mavo
588 Melanchthon Schiebroek
589 Melanchthon Wilgenplaslaan
4452 Melanchthon Mavo Schiebroek
594 Melanchthon Kralingen
604 Comenius Dalton Rotterdam
26152 Zuider Gymnasium
... and some more ...
CodePudding user response:
Expanding on @0stone0's answer this works even with less code:
import requests
import json
import pandas
json_data = {
'zoekterm': 'rotterdam',
'sectorKeuze': 1,
'weergave': 'Lijst',
}
response = requests.post('https://scholenopdekaart.nl/api/v1/search/',
json=json_data)
data = json.loads(response.content)
df = pd.DataFrame(data["scholen"])
df[["bisId", "naam"]].head()
Output:
bisId naam
0 25592 VSO Op Zuid
1 25938 Op Noord
2 569 Accent Praktijkonderwijs Centrum
3 572 Accent PRO Delfshaven
4 574 Marnix Gymnasium