New day, new link, new problem.
Link: https://schoolinfo.leicester.gov.uk/DirectorySearch.html
The above link is just a html website a form embedded within it. Clicking on the Search
button, loads the list of schools while the names schools being a hypertext.
Through inspection, I figured out the request URL
that has all the data in it. But that's not true. Opening the link loads a json
page that has no required data.
I tried to embed that with the main url that is attached above, which gave me same result (which I believe, is expected). I scraped another url that is it's api but couldn't accessible.
I want to access the table but in vain. What type of problem it is? Like, does it have a name to call?
import requests
def data_fetch(request_url, url_main):
headers = {
"Accept": '*/*',
"Host": 'schoolinfo.leicester.gov.uk',
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
"Referer": url_main,
"X-Requested-With": "XMLHttpRequest"
}
soup = requests.get(request_url, headers=headers).json()
print(soup)
def main():
request_url = "https://schoolinfo.leicester.gov.uk/api/schoolsearch/?schoolPhase=&schoolName=&pn=0&pa=900&schoolType="
url_main = "https://schoolinfo.leicester.gov.uk/DirectorySearch.html"
data_fetch(request_url, url_main)
if __name__ == "__main__":
main()
While I expect the above code to give me the table (which it shouldn't in reality), it gives the same output as the below one.
def url_parser(url):
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'lxml')
return soup
url = "https://schoolinfo.leicester.gov.uk/api/schoolsearch/?schoolPhase=&schoolName=&pn=0&pa=900&schoolType="
print(url_parser(url))
This gives output as (I cut down some output to not cross body limit size):
[{"SchoolName":"Abbey Mead Primary Academy","BaseID":1,"SchoolType":"Academy","DfeNumber":null,"Ward":null,"RoleDesc":null,"Head":null,"Address":null,"Postcode":null,"BaseURL":null,"SchoolInspectionsURL":null,"SchoolReportsURL":null,"SchoolVacanciesURL":null,"Telephone":null,"Fax":null,"Email":null,"Sponsor":null,"UPRN":null,"SchoolPhase":"Primary","MapUrl":null,"AgeRange":null,"OverSubscribed":null,"ArrangementURL":null,"PAN":null},..."RoleDesc":null,"Head":null,"Address":null,"Postcode":null,"BaseURL":null,"SchoolInspectionsURL":null,"SchoolReportsURL":null,"SchoolVacanciesURL":null,"Telephone":null,"Fax":null,"Email":null,"Sponsor":null,"UPRN":null,"SchoolPhase":"Primary","MapUrl":null,"AgeRange":null,"OverSubscribed":null,"ArrangementURL":null,"PAN":null}]
which is useless.
Is there any way that is possible for me to access the link or any resource so that I can look into it?
Edit: As I got the output, I thought of using baseID with the school links(https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=1) where r=1
is the baseID. I then randomly tested whether everything's correct, the total that was shown is 87, which I don't believe as true. Also, by this, I couldn't just run a for loop from index 1 until 87 as there are many discrepencies like some baseIDs being 13k, 100 and etc..,
CodePudding user response:
The source of e.g. https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=1
(first entry) is https://schoolinfo.leicester.gov.uk/api/schools/1
, where 1
is a reference to the BaseID
. On how to find this source, see Update below. So, you can try something as follows:
import requests
import json
search = 'https://schoolinfo.leicester.gov.uk/api/schoolsearch/?schoolPhase=&schoolName=&pn=0&pa=900&schoolType='
# get list with dicts for each school
data = json.loads(requests.get(search).json())
# use key `BaseID` to get the ID for each school
ids = [i['BaseID'] for i in data]
# create empty `list` to store dicts for each school
schools = list()
# iterate over the `IDs` (use `i(ndex)` to keep track of entry for error message)
for i, _id in enumerate(ids):
# create url string with `ID` added at end
api = f'https://schoolinfo.leicester.gov.uk/api/schools/{_id}'
# try to fetch data, on error: print error message
try:
school = json.loads(requests.get(api).json())
# append school `dict` to list of schools
schools.append(school)
except:
print(f'No data for school \'{data[i]["SchoolName"]}\' (ID: {_id})')
This will result in a list
with 110
dictionaries. One for each school. The data for "Hope Hamilton C of E Primary School" (actually a duplicate; the only one) is missing. Indeed, if you click the link, the entry is just empty. So, the above will print:
No data for school 'Hope Hamilton C of E Primary School' (ID: 609)
No data for school 'Hope Hamilton C of E Primary School' (ID: 609)
You could create a df
with this info. E.g.
df = pd.DataFrame(schools)
print(df.iloc[0])
SchoolName Abbey Mead Primary Academy
BaseID 1
SchoolType Academy
DfeNumber 2337
Ward Belgrave
RoleDesc Head Principal
Head Mr Gary Aldred
Address Abbey Mead Primary Academy, 109 Ross Walk, Lei...
Postcode LE4 5HH
BaseURL http://www.abbey-tmet.uk
SchoolInspectionsURL https://reports.ofsted.gov.uk/provider/21/147148
SchoolReportsURL https://www.compare-school-performance.service...
SchoolVacanciesURL https://www.eteach.com/microsite/ourjobs.aspx?...
Telephone 0116 266 1809
Fax 0116 2611543
Email [email protected]
Sponsor The Mead Educational Trust
UPRN 147148
SchoolPhase Primary
MapUrl <div id="sample" style="width: 500px; height: ..."
AgeRange 3-11
OverSubscribed Yes
ArrangementURL https://www.leicester.gov.uk/schools-and-learn...
PAN Reception: 90
Name: 0, dtype: object
Update
Let me just add how you can figure out that https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=1
is based on https://schoolinfo.leicester.gov.uk/api/schools/1
:
- First go to the url
https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=1
in Chrome. Open Chrome DevTools (Ctrl Shift J
), and select "Sources" from the navigation bar. - To the left, you should now see a folder. We are interested in the first subfolder
"(cloud) https://schoolinfo.leicester.gov.uk/"
. Inside it, navigate to "Scripts" and open the file "script.js". - Use "Ctrl F" to open the search bar, and here I just searched for "api". You'll get 1 match:
{
url: "/api/schools/" uniqueRef,
success: function (result) {
sObj = JSON.parse(result);
}
That's the info we want!
CodePudding user response:
You can try:
import json
import requests
api_url = "https://schoolinfo.leicester.gov.uk/api/schoolsearch/"
params = {
"schoolPhase": "",
"schoolName": "",
"pn": "0",
"pa": "900",
"schoolType": "",
}
data = json.loads(requests.get(api_url, params=params).json())
for d in data:
print(
"{:<50} {:<40} {:<30} {}".format(
d["SchoolName"],
d["SchoolType"],
d["SchoolPhase"],
f'https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r={d["BaseID"]}',
)
)
Prints:
...
Thurnby Mead Primary Academy Academy Primary https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=80
Tudor Grange Samworth Academy, A C of E School Academy Sponsor Led Primary and Secondary https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=14031
Uplands Infant School Academy Infant https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=81
Uplands Junior L.E.A.D Academy Academy Sponsor Led Junior https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=13052
West Gate School Local Authority maintained Special https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=1236
Whitehall Primary School Local Authority maintained Primary https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=83
Willowbrook Mead Primary Academy Academy Primary https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=84
Wolsey House Primary School Local Authority maintained Primary https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=85
Woodstock Primary Academy Academy Primary https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=86
Wyvern Primary School Local Authority maintained Primary https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=87
To get info about a school, open the link and grab required data.