Home > Software engineering >  webscrape a embeded second page in python
webscrape a embeded second page in python

Time:11-27

so I am working on a project of web scraping. My goal is to web scrape the shanghai university ranking to get name, country and rank. Right now I am only focusing on the name.

import requests

from bs4 import BeautifulSoup

arwu = open('arwu.txt', 'a')
arwu.truncate()
universities = []
#Gets the url from which it should web scrape
url = 'https://www.shanghairanking.com/rankings/arwu/2021.html'
response = requests.get(url)

#initializes the bs4 html parser
soup = BeautifulSoup(response.text, "html.parser")

#retrieves all the university names that are displayed and formats them
def find_universities():
    for university in range(len(soup.findAll(class_ ='global-univ'))):
        one_a_tag = str(soup.findAll(class_ = 'global-univ')[university].text) 
        one_a_tag=one_a_tag[len(one_a_tag)//2 16:]         
        universities.append(str(one_a_tag))
    return universities

universities=find_universities()
for x in range(len(universities)):
  arwu.write(universities[x]  "\n")
arwu.close()

As of right now, this only retrieves the first 30 universities displayed on the first page. How can I access the other pages?

CodePudding user response:

The data from the next pages are loaded dynamically by javascript that's why only the BeautifulSoup can't parse it. To grab the next pages data, you must need an automation tool something like selenium. Here I use selenium with BeautifulSoup to extract data from the next pages and it's working fine.

import time
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(8)

url = 'https://www.shanghairanking.com/rankings/arwu/2021'
driver.get(url)
time.sleep(4)

universities = []
while True:# while for the next page
    
    soup = BeautifulSoup(driver.page_source, 'lxml')
    for university in soup.select('div.link-container > a'):
        un = university.select_one('span.univ-name')
        versity = un.get_text(strip=True) if un else None
        print(versity)
    print("-" * 85)
    
    next_page=driver.find_element_by_xpath('(//a[@])[3]')#next page   
    if next_page:
        next_page.click()
        time.sleep(2)
    else:
        break

Output:

Harvard University
Stanford University
University of Cambridge
Massachusetts Institute of Technology (MIT)
University of California, Berkeley
Princeton University
University of Oxford
Columbia University
California Institute of Technology
University of Chicago
Yale University
Cornell University
Paris-Saclay University
University of California, Los Angeles
University of Pennsylvania
Johns Hopkins University
University College London
University of California, San Diego
University of Washington
University of California, San Francisco
ETH Zurich
University of Toronto
Washington University in St. Louis
The University of Tokyo
Imperial College London
New York University
Tsinghua University
University of North Carolina at Chapel Hill
University of Copenhagen
None
------------------------------------------------------------------------------------- 
University of Wisconsin - Madison
Duke University
The University of Melbourne
Northwestern University
Sorbonne University
The University of Manchester
Kyoto University
PSL University
The University of Edinburgh
University of Minnesota, Twin Cities
The University of Texas at Austin
Karolinska Institute
Rockefeller University
University of British Columbia
Peking University
University of Colorado at Boulder
King's College London
The University of Texas Southwestern Medical Center at Dallas
University of Munich
Utrecht University
The University of Queensland
Technical University of Munich
Zhejiang University
University of Zurich
University of Illinois at Urbana-Champaign
University of Maryland, College Park
Heidelberg University
University of California, Santa Barbara
Shanghai Jiao Tong University
University of Geneva
None
------------------------------------------------------------------------------------- 
University of Oslo
University of Southern California
University of Science and Technology of China
University of Groningen
The University of New South Wales
Vanderbilt University
McGill University
The University of Texas M. D. Anderson Cancer Center
University of Sydney
University of California, Irvine
Aarhus University
Ghent University
University of Paris
Stockholm University
National University of Singapore
The Australian National University
Fudan University
University of Bristol
Uppsala University
Monash University
Nanyang Technological University
University of Helsinki
Leiden University
Nagoya University
University of Bonn
Purdue University - West Lafayette
KU Leuven
University of Basel
Sun Yat-sen University
The Hebrew University of Jerusalem
None
------------------------------------------------------------------------------------- 
Swiss Federal Institute of Technology Lausanne
McMaster University
Weizmann Institute of Science
Technion-Israel Institute of Technology      
Boston University
The University of Western Australia
Carnegie Mellon University
Moscow State University
University of Florida
University of California, Davis
Aix Marseille University
Arizona State University
Brown University
Case Western Reserve University
Emory University
Erasmus University Rotterdam
Georgia Institute of Technology
Huazhong University of Science and Technology
Icahn School of Medicine at Mount Sinai      
Indiana University Bloomington
King Abdulaziz University
King Saud University
Mayo Clinic Alix School of Medicine
Michigan State University
Nanjing University
Norwegian University of Science and Technology - NTNU
Pennsylvania State University - University Park
Radboud University Nijmegen
Rice University
Rutgers, The State University of New Jersey - New Brunswick
None
------------------------------------------------------------------------------------- 
Seoul National University
The Chinese University of Hong Kong
The Ohio State University - Columbus
The University of Adelaide
The University of Hong Kong
The University of Sheffield
Tokyo Institute of Technology
Université Grenoble Alpes
Université libre de Bruxelles (ULB)
University of Alberta
University of Amsterdam
University of Arizona
University of Bern
University of Birmingham
University of Freiburg
University of Goettingen
University of Gothenburg
University of Lausanne
University of Leeds
University of Liverpool
University of Montreal
University of Nottingham
University of Pittsburgh
University of Sao Paulo
University of Strasbourg
University of Utah
University of Warwick
Vrije Universiteit Amsterdam
Wageningen University & Research
Xi'an Jiaotong University
None
------------------------------------------------------------------------------------- 
University of Houston
University of Illinois at Chicago
University of Innsbruck
University of Iowa
University of Kansas
University of Kiel
University of Leipzig
University of Lisbon
University of Lorraine
University of Mainz
University of Massachusetts Amherst
University of Massachusetts Medical School - Worcester
University of Miami
University of Missouri - Columbia
University of Nebraska - Lincoln
University of Ottawa
University of Science and Technology Beijing
University of South Florida
University of Technology Sydney
University of Tennessee - Knoxville
University of Tsukuba
University of Turin
University of Wollongong
University of Wuerzburg
Virginia Commonwealth University
Virginia Polytechnic Institute and State University
Vrije Universiteit Brussel (VUB)
Western University
Xiamen University
Yonsei University
None
------------------------------------------------------------------------------------- 
Indian Institute of Science
Istanbul University
Jagiellonian University
Jinan University
Kansas State University
King Fahd University of Petroleum & Minerals
Kobe University
Kyung Hee University
Mahidol University
Medical University of Innsbruck
Nanjing Normal University
Nanjing University of Information Science & Technology
National University of Ireland, Galway
National Yang Ming Chiao Tung University
Northern Arizona University
Okayama University
Pohang University of Science and Technology
Pompeu Fabra University
Pusan National University
Qingdao University
Queen's University Belfast
Rensselaer Polytechnic Institute
Saint Louis University
Scuola Normale Superiore - Pisa
Shandong University of Science and Technology
ShanghaiTech University
South China Agricultural University
Southern Medical University
None
------------------------------------------------------------------------------------- 
University of Kent
University of Konstanz
University of Ljubljana
University of Navarra
University of Nevada - Reno
University of New Hampshire
University of Oklahoma - Norman
University of Palermo
University of Parma
University of Plymouth
University of Portsmouth
University of Regensburg
University of Rennes 1
University of Roma - Tor Vergata
University of Rostock
University of Salerno
University of Sherbrooke
University of Siena
Tampere University
University of Tromso
University of Ulsan
University of Verona
University of Vigo
University of Zaragoza
University Rovira i Virgili
Waseda University
Wenzhou Medical University
Yunnan University
Zhejiang University of Technology
None
------------------------------------------------------------------------------------- 
Dalian Maritime University
Dokuz Eylul University
Federal University of Sao Carlos
Fluminense Federal University
Fujian Agriculture and Forestry University
Fujian Medical University
Fujian Normal University
Graz University of Technology
Guangxi University
Hacettepe University
Henan University
Indian Institute of Technology Delhi
Indian Institute of Technology Kharagpur
Indian Institute of Technology Madras
INHA University
Jawaharlal Nehru University
Kanazawa University
Kaohsiung Medical University
Kindai University
Kunming University of Science and Technology
Lincoln University
Mansoura University
Massey University
Medical University of Warsaw
National Research Nuclear University MEPhI (Moscow Engineering Physics Institute)     
New Jersey Institute of Technology
New Mexico State University
Ningbo University
North China Electric Power University
None
------------------------------------------------------------------------------------- 
Uniformed Services University of the Health Sciences
Universidad Andrés Bello
Universidad de Las Palmas de Gran Canaria
Universidad Pablo de Olavide
Université Gustave Eiffel
University of Agriculture Faisalabad
University of Alcalá
University of Cagliari
University of Concepcion
University of Cordoba
University of Engineering and Technology (UET)
University of Girona
University of Greenwich
University of Hull
University of L'Aquila
University of North Carolina at Greensboro
University of Savoy
University of St. Gallen
University of Stirling
University of Tabriz
University of the Punjab
University of Thessaly
University of Urbino
University of Veterinary Medicine Vienna
Vellore Institute of Technology
Warsaw University of Life Sciences
Westlake University
Wroclaw Medical University
Yanshan University
Zagazig University
None
------------------------------------------------------------------------------------- 
University of Ulster
University of Valladolid
University of Wuppertal
University Paris Est Creteil
Vilnius Gediminas Technical University
Warsaw University of Technology
Williams College
Wroclaw University of Science and Technology
Wuhan University of Science and Technology
Yantai University
None
  • Related