Scraping a HTML site using BeautifulSoup and finding the value of "total

I'm writing a python code that scrapes the following website and looks for the value of "total_pages" in it.

The website is https://www.usnews.com/best-colleges/fl

When I open the website in a browser and investigate the source code, the value of "total_pages" is 8. I want my python code to be able to get the same value.

I have written the following code:

import requests
from bs4 import BeautifulSoup

headers ={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

main_site=requests.get("https://www.usnews.com/best-colleges/fl",headers=headers)
main_site_content=main_site.content
main_site_content_soup=BeautifulSoup(main_site_content,"html.parser")

But then I get stuck on how to look for the "total_pages" in the parsed data. I have tried find_all() method but no luck. I think I'm not using the method correctly.

One note: the solution does not have to use BeautifulSoup. I just used BeautifulSoup since I was a bit familiar with it.

CodePudding user response：

No need for BeautifulSoup. Here I make a request to their API to get the list of universities.

from rich import print is used to pretty-print the JSON. It should make it easier to read.

Need more help or advice, leave a comment below.

import requests
from rich import print

LINK = "https://www.usnews.com/best-colleges/api/search?format=json&location=Florida&_sort=rank&_sortDirection=asc&_page=1"


def get_data(url):
    print("Making request to:", url)
    response = requests.get(url, timeout=5, headers={"User-Agent": "Mozilla/5.0"})
    if response.status_code == 200:
        print("Request Successful!")
        data = response.json()["data"]
        return data["items"], data["next_link"]
    print("Request failed!")
    return None, None


def main():
    print("Starting Scraping...")
    items, next_link = get_data(LINK)

    # if there's a `next_link`, scrape it.
    while next_link is not None:
        print("Getting data from:", next_link)
        new_items, next_link = get_data(next_link)
        items  = new_items

    # cleaning the data, for the pandas dataframe.
    items = [
        {
            "name": item["institution"]["displayName"],
            "state": item["institution"]["state"],
            "city": item["institution"]["city"],
        }
        for item in items
    ]
    df = pd.DataFrame(items)
    print(df.to_markdown())


if __name__ == "__main__":
    main()

The output looks like this:

	name	state	city
0	University of Florida	FL	Gainesville
1	Florida State University	FL	Tallahassee
2	University of Miami	FL	Coral Gables
3	University of South Florida	FL	Tampa
4	University of Central Florida	FL	Orlando
5	Florida International University	FL	Miami
6	Florida A&M University	FL	Tallahassee
7	Florida Institute of Technology	FL	Melbourne
8	Nova Southeastern University	FL	Ft. Lauderdale
...	...	...	...
74	St. John Vianney College Seminary	FL	Miami
75	St. Petersburg College	FL	St. Petersburg
76	Tallahassee Community College	FL	Tallahassee
77	Valencia College	FL	Orlando