Home > Software design >  Scraping a HTML site using BeautifulSoup and finding the value of "total_pages" in it
Scraping a HTML site using BeautifulSoup and finding the value of "total_pages" in it

Time:05-02

I'm writing a python code that scrapes the following website and looks for the value of "total_pages" in it.

The website is https://www.usnews.com/best-colleges/fl

When I open the website in a browser and investigate the source code, the value of "total_pages" is 8. I want my python code to be able to get the same value.

I have written the following code:

import requests
from bs4 import BeautifulSoup

headers ={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

main_site=requests.get("https://www.usnews.com/best-colleges/fl",headers=headers)
main_site_content=main_site.content
main_site_content_soup=BeautifulSoup(main_site_content,"html.parser")

But then I get stuck on how to look for the "total_pages" in the parsed data. I have tried find_all() method but no luck. I think I'm not using the method correctly.

One note: the solution does not have to use BeautifulSoup. I just used BeautifulSoup since I was a bit familiar with it.

CodePudding user response:

No need for BeautifulSoup. Here I make a request to their API to get the list of universities.

from rich import print is used to pretty-print the JSON. It should make it easier to read.

Need more help or advice, leave a comment below.

import requests
from rich import print

LINK = "https://www.usnews.com/best-colleges/api/search?format=json&location=Florida&_sort=rank&_sortDirection=asc&_page=1"


def get_data(url):
    print("Making request to:", url)
    response = requests.get(url, timeout=5, headers={"User-Agent": "Mozilla/5.0"})
    if response.status_code == 200:
        print("Request Successful!")
        data = response.json()["data"]
        return data["items"], data["next_link"]
    print("Request failed!")
    return None, None


def main():
    print("Starting Scraping...")
    items, next_link = get_data(LINK)

    # if there's a `next_link`, scrape it.
    while next_link is not None:
        print("Getting data from:", next_link)
        new_items, next_link = get_data(next_link)
        items  = new_items

    # cleaning the data, for the pandas dataframe.
    items = [
        {
            "name": item["institution"]["displayName"],
            "state": item["institution"]["state"],
            "city": item["institution"]["city"],
        }
        for item in items
    ]
    df = pd.DataFrame(items)
    print(df.to_markdown())


if __name__ == "__main__":
    main()

The output looks like this:

name state city
0 University of Florida FL Gainesville
1 Florida State University FL Tallahassee
2 University of Miami FL Coral Gables
3 University of South Florida FL Tampa
4 University of Central Florida FL Orlando
5 Florida International University FL Miami
6 Florida A&M University FL Tallahassee
7 Florida Institute of Technology FL Melbourne
8 Nova Southeastern University FL Ft. Lauderdale
... ... ... ...
74 St. John Vianney College Seminary FL Miami
75 St. Petersburg College FL St. Petersburg
76 Tallahassee Community College FL Tallahassee
77 Valencia College FL Orlando
  • Related