Python University Names and Abbrevations and Weblink-CodePudding

I want to prepare a dataframe of universities, its abbrevations and website link.

My code:

abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
abb_df_list = pd.read_html(abb_html)

Present answer:

ValueError: No tables found

Expected answer:

df =

|      |  university_full_name              |  uni_abb  |  uni_url|
---------------------------------------------------------------------
|    0 |  Albert Einstein College of Medicine | AECOM   |  https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine|

CodePudding user response：

That's one funky page you have there...

First, there are indeed no tables in there. Second, some organizations don't have links, others have redirect links and still others use the same abbreviation for more than one organization.

So you need to bring in the heavy artillery: xpath...

import pandas as pd
import requests
from lxml import html as lh

url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)

doc = lh.fromstring(response.text)
rows = []
for uni in doc.xpath('//h2[./span[@]]//following-sibling::ul//li'):    
    info = uni.text.split(' – ')
    abb = info[0]
    
    #for those w/ no links
    if not uni.xpath('.//a'):
        rows.append((abb," ",info[1]))

    #now to account for those using the same abbreviation for multiple teams
    for a in uni.xpath('.//a'):
        dat = a.xpath('./@*')
        
        #for those with redirects
        if len(dat)==3:
            del dat[1]
        link = f"https://en.wikipedia.org{dat[0]}"
        rows.append((abb,link,dat[1]))
   
#and now, at last, to the dataframe
cols = ['abb','url','full name']
df = pd.DataFrame(rows,columns=cols)
df

Output:

    abb     url                                               full name
0   AECOM   https://en.wikipedia.org/wiki/Albert_Einstein_...   Albert Einstein College of Medicine
1   AFA     https://en.wikipedia.org/wiki/United_States_Ai...   United States Air Force Academy

etc.

Note: you can rearrange the order of columns in the dataframe, if you are so inclined.

CodePudding user response：

Select and iterate only the expected <li> and extract its information, but be aware there is a university without an <a> (SUI – State University of Iowa), so this should be handled with if-statement in example:

for e in soup.select('h2   ul li'):
    data.append({
        'abb':e.text.split('-')[0],
        'full_name':e.text.split('-')[-1],
        'url':'https://en.wikipedia.org'   e.a.get('href') if e.a else None
    })

Example

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
soup = BeautifulSoup(response.text)

data = []
for e in soup.select('h2   ul li'):
    data.append({
        'abb':e.text.split('-')[0],
        'full_name':e.text.split('-')[-1],
        'url':'https://en.wikipedia.org'   e.a.get('href') if e.a else None
    })

pd.DataFrame(data)

Output:

	abb	full_name	url
0	AECOM	Albert Einstein College of Medicine	https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine
1	AFA	United States Air Force Academy	https://en.wikipedia.org/wiki/United_States_Air_Force_Academy
2	Annapolis	U.S. Naval Academy	https://en.wikipedia.org/wiki/United_States_Naval_Academy
3	A&M	Texas A&M University, but also others; see A&M	https://en.wikipedia.org/wiki/Texas_A&M_University
4	A&M-CC or A&M-Corpus Christi	Corpus Christi	https://en.wikipedia.org/wiki/Texas_A&M_University–Corpus_Christi

...

CodePudding user response：

There are no tables on this page, but lists. So the goal will be to go through the <ul> and then <li> tags, skipping the paragraphs you are not interested in (the first and those after the 26th). You can extract aab_code of the university this way:

uni_abb = li.text.strip().replace(' - ', ' - ').replace(' - ', ' - ').split(' - ')[0]

while to get the url you have to access the 'href' and 'title' parameter inside the <a> tag:

for a in li.find_all('a', href=True):
    title = a['title']
    url= f"https://en.wikipedia.org/{a['href']}"

Accumulate the extracted information into a list, and finally create the dataframe by assigning appropriate column names.

Here is the complete code, in which I use BeautifulSoup:

import requests
import pandas as pd
from bs4 import BeautifulSoup

abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content

soup = BeautifulSoup(abb_html)
l = []
for ul in soup.find_all("ul")[1:26]:
    for li in ul.find_all("li"):
        uni_abb = li.text.strip().replace(' - ', ' – ').replace(' — ', ' – ').split(' – ')[0]
        for a in li.find_all('a', href=True):
            l.append((a['title'], uni_abb, f"https://en.wikipedia.org/{a['href']}"))
            
df = pd.DataFrame(l, columns=['university_full_name', 'uni_abb', 'uni_url'])

Result:

                         university_full_name    uni_abb      uni_url
0             Albert Einstein College of Medicine      AECOM  https://en.wikipedia.org//wiki/Albert_Einstein...
1                 United States Air Force Academy        AFA  https://en.wikipedia.org//wiki/United_States_A...