Home > front end >  Fetch URL hidden under the search button that is inside a form [Beautifulsoup]
Fetch URL hidden under the search button that is inside a form [Beautifulsoup]

Time:10-09

New day, new link, new problem.
Link: https://schoolinfo.leicester.gov.uk/DirectorySearch.html
The above link is just a html website a form embedded within it. Clicking on the Search button, loads the list of schools while the names schools being a hypertext.
Through inspection, I figured out the request URL that has all the data in it. But that's not true. Opening the link loads a json page that has no required data.
I tried to embed that with the main url that is attached above, which gave me same result (which I believe, is expected). I scraped another url that is it's api but couldn't accessible.

I want to access the table but in vain. What type of problem it is? Like, does it have a name to call?

import requests


def data_fetch(request_url, url_main):
    headers = {
        "Accept": '*/*',
        "Host": 'schoolinfo.leicester.gov.uk',
        "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
        "Referer": url_main,
        "X-Requested-With": "XMLHttpRequest"
    }
    soup = requests.get(request_url, headers=headers).json()
    print(soup)


def main():
    request_url = "https://schoolinfo.leicester.gov.uk/api/schoolsearch/?schoolPhase=&schoolName=&pn=0&pa=900&schoolType="
    url_main = "https://schoolinfo.leicester.gov.uk/DirectorySearch.html"
    data_fetch(request_url, url_main)


if __name__ == "__main__":
    main()

While I expect the above code to give me the table (which it shouldn't in reality), it gives the same output as the below one.

def url_parser(url):
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc, 'lxml')
    return soup

url = "https://schoolinfo.leicester.gov.uk/api/schoolsearch/?schoolPhase=&schoolName=&pn=0&pa=900&schoolType="
print(url_parser(url))

This gives output as (I cut down some output to not cross body limit size):

[{"SchoolName":"Abbey Mead Primary Academy","BaseID":1,"SchoolType":"Academy","DfeNumber":null,"Ward":null,"RoleDesc":null,"Head":null,"Address":null,"Postcode":null,"BaseURL":null,"SchoolInspectionsURL":null,"SchoolReportsURL":null,"SchoolVacanciesURL":null,"Telephone":null,"Fax":null,"Email":null,"Sponsor":null,"UPRN":null,"SchoolPhase":"Primary","MapUrl":null,"AgeRange":null,"OverSubscribed":null,"ArrangementURL":null,"PAN":null},..."RoleDesc":null,"Head":null,"Address":null,"Postcode":null,"BaseURL":null,"SchoolInspectionsURL":null,"SchoolReportsURL":null,"SchoolVacanciesURL":null,"Telephone":null,"Fax":null,"Email":null,"Sponsor":null,"UPRN":null,"SchoolPhase":"Primary","MapUrl":null,"AgeRange":null,"OverSubscribed":null,"ArrangementURL":null,"PAN":null}]

which is useless.
Is there any way that is possible for me to access the link or any resource so that I can look into it?

Edit: As I got the output, I thought of using baseID with the school links(https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=1) where r=1 is the baseID. I then randomly tested whether everything's correct, the total that was shown is 87, which I don't believe as true. Also, by this, I couldn't just run a for loop from index 1 until 87 as there are many discrepencies like some baseIDs being 13k, 100 and etc..,

CodePudding user response:

The source of e.g. https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=1 (first entry) is https://schoolinfo.leicester.gov.uk/api/schools/1, where 1 is a reference to the BaseID. On how to find this source, see Update below. So, you can try something as follows:

import requests
import json

search = 'https://schoolinfo.leicester.gov.uk/api/schoolsearch/?schoolPhase=&schoolName=&pn=0&pa=900&schoolType='

# get list with dicts for each school
data = json.loads(requests.get(search).json())

# use key `BaseID` to get the ID for each school
ids = [i['BaseID'] for i in data]

# create empty `list` to store dicts for each school
schools = list()

# iterate over the `IDs` (use `i(ndex)` to keep track of entry for error message)
for i, _id in enumerate(ids):
    
    # create url string with `ID` added at end
    api = f'https://schoolinfo.leicester.gov.uk/api/schools/{_id}'
    
    # try to fetch data, on error: print error message
    try:
        school = json.loads(requests.get(api).json())

        # append school `dict` to list of schools
        schools.append(school)
    except:
        print(f'No data for school \'{data[i]["SchoolName"]}\' (ID: {_id})')

This will result in a list with 110 dictionaries. One for each school. The data for "Hope Hamilton C of E Primary School" (actually a duplicate; the only one) is missing. Indeed, if you click the link, the entry is just empty. So, the above will print:

No data for school 'Hope Hamilton C of E Primary School' (ID: 609)
No data for school 'Hope Hamilton C of E Primary School' (ID: 609)

You could create a df with this info. E.g.

df = pd.DataFrame(schools)

print(df.iloc[0])

SchoolName                                     Abbey Mead Primary Academy
BaseID                                                                  1
SchoolType                                                        Academy
DfeNumber                                                            2337
Ward                                                             Belgrave
RoleDesc                                                   Head Principal
Head                                                       Mr Gary Aldred
Address                 Abbey Mead Primary Academy, 109 Ross Walk, Lei...
Postcode                                                          LE4 5HH
BaseURL                                          http://www.abbey-tmet.uk
SchoolInspectionsURL     https://reports.ofsted.gov.uk/provider/21/147148
SchoolReportsURL        https://www.compare-school-performance.service...
SchoolVacanciesURL      https://www.eteach.com/microsite/ourjobs.aspx?...
Telephone                                                   0116 266 1809
Fax                                                          0116 2611543
Email                                       [email protected]
Sponsor                                        The Mead Educational Trust
UPRN                                                               147148
SchoolPhase                                                       Primary
MapUrl                 <div id="sample" style="width: 500px; height: ..."
AgeRange                                                             3-11
OverSubscribed                                                        Yes
ArrangementURL          https://www.leicester.gov.uk/schools-and-learn...
PAN                                                         Reception: 90
Name: 0, dtype: object

Update

Let me just add how you can figure out that https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=1 is based on https://schoolinfo.leicester.gov.uk/api/schools/1:

  1. First go to the url https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=1 in Chrome. Open Chrome DevTools (Ctrl Shift J), and select "Sources" from the navigation bar.
  2. To the left, you should now see a folder. We are interested in the first subfolder "(cloud) https://schoolinfo.leicester.gov.uk/". Inside it, navigate to "Scripts" and open the file "script.js".
  3. Use "Ctrl F" to open the search bar, and here I just searched for "api". You'll get 1 match:
{
url: "/api/schools/"   uniqueRef,
success: function (result) {
    sObj = JSON.parse(result);
}

That's the info we want!

CodePudding user response:

You can try:

import json
import requests

api_url = "https://schoolinfo.leicester.gov.uk/api/schoolsearch/"

params = {
    "schoolPhase": "",
    "schoolName": "",
    "pn": "0",
    "pa": "900",
    "schoolType": "",
}

data = json.loads(requests.get(api_url, params=params).json())

for d in data:
    print(
        "{:<50} {:<40} {:<30} {}".format(
            d["SchoolName"],
            d["SchoolType"],
            d["SchoolPhase"],
            f'https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r={d["BaseID"]}',
        )
    )

Prints:


...

Thurnby Mead Primary Academy                       Academy                                  Primary                        https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=80
Tudor Grange Samworth Academy, A C of  E School    Academy Sponsor Led                      Primary and Secondary          https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=14031
Uplands Infant School                              Academy                                  Infant                         https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=81
Uplands Junior L.E.A.D Academy                     Academy Sponsor Led                      Junior                         https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=13052
West Gate School                                   Local Authority maintained               Special                        https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=1236
Whitehall Primary School                           Local Authority maintained               Primary                        https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=83
Willowbrook Mead Primary Academy                   Academy                                  Primary                        https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=84
Wolsey House Primary School                        Local Authority maintained               Primary                        https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=85
Woodstock Primary Academy                          Academy                                  Primary                        https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=86
Wyvern Primary School                              Local Authority maintained               Primary                        https://schoolinfo.leicester.gov.uk/SchoolDetails.html?r=87

To get info about a school, open the link and grab required data.

  • Related