I'm writing a python code that scrapes the following website and looks for the value of "total_pages" in it.
The website is https://www.usnews.com/best-colleges/fl
When I open the website in a browser and investigate the source code, the value of "total_pages" is 8. I want my python code to be able to get the same value.
I have written the following code:
import requests
from bs4 import BeautifulSoup
headers ={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
main_site=requests.get("https://www.usnews.com/best-colleges/fl",headers=headers)
main_site_content=main_site.content
main_site_content_soup=BeautifulSoup(main_site_content,"html.parser")
But then I get stuck on how to look for the "total_pages" in the parsed data. I have tried find_all()
method but no luck. I think I'm not using the method correctly.
One note: the solution does not have to use BeautifulSoup. I just used BeautifulSoup since I was a bit familiar with it.
CodePudding user response:
No need for BeautifulSoup. Here I make a request to their API to get the list of universities.
from rich import print
is used to pretty-print the JSON. It should make it easier to read.
Need more help or advice, leave a comment below.
import requests
from rich import print
LINK = "https://www.usnews.com/best-colleges/api/search?format=json&location=Florida&_sort=rank&_sortDirection=asc&_page=1"
def get_data(url):
print("Making request to:", url)
response = requests.get(url, timeout=5, headers={"User-Agent": "Mozilla/5.0"})
if response.status_code == 200:
print("Request Successful!")
data = response.json()["data"]
return data["items"], data["next_link"]
print("Request failed!")
return None, None
def main():
print("Starting Scraping...")
items, next_link = get_data(LINK)
# if there's a `next_link`, scrape it.
while next_link is not None:
print("Getting data from:", next_link)
new_items, next_link = get_data(next_link)
items = new_items
# cleaning the data, for the pandas dataframe.
items = [
{
"name": item["institution"]["displayName"],
"state": item["institution"]["state"],
"city": item["institution"]["city"],
}
for item in items
]
df = pd.DataFrame(items)
print(df.to_markdown())
if __name__ == "__main__":
main()
The output looks like this:
name | state | city | |
---|---|---|---|
0 | University of Florida | FL | Gainesville |
1 | Florida State University | FL | Tallahassee |
2 | University of Miami | FL | Coral Gables |
3 | University of South Florida | FL | Tampa |
4 | University of Central Florida | FL | Orlando |
5 | Florida International University | FL | Miami |
6 | Florida A&M University | FL | Tallahassee |
7 | Florida Institute of Technology | FL | Melbourne |
8 | Nova Southeastern University | FL | Ft. Lauderdale |
... | ... | ... | ... |
74 | St. John Vianney College Seminary | FL | Miami |
75 | St. Petersburg College | FL | St. Petersburg |
76 | Tallahassee Community College | FL | Tallahassee |
77 | Valencia College | FL | Orlando |