I am trying to build a database on US universities. I have been using Beautiful Soup and Pandas to do so, but have encounter difficulties as there were several tables to scrap per page. In order to reunite the data extracted from two tables, I tried to use .merge(), but haven't succeeded at all.
My code is as follows:
# Connecticut
url='https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Connecticut'
soup=bs(requests.get(url).text)
table = soup.find_all('table')
#Extracting a df for each table
df1 = pd.read_html(str(table))[0]
df1.rename(columns = {'Enrollment(2020)[4]': 'Enrollment', 'Founded[5]':'Founded'}, inplace = True)
df2 = pd.read_html(str(table))[1]
df2=df2.drop(['Type','Ref.'], axis=1)
df_Connecticut=df1.merge(df2, on=['School','Location','Control','Founded'])
df_Connecticut
I have tried to do it with other states, but still encounter the same problem:
Maine
url='https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Maine'
soup=bs(requests.get(url).text)
table = soup.find_all('table')
#Extracting a df for each table
df1 = pd.read_html(str(table))[0]
df1=df1.drop(['Type[a]'], axis=1)
df1.rename(columns = {'Location(s)': 'Location', 'Enrollment (2019)[b]':'Enrollment'}, inplace = True)
df1 = df1.astype({'School':'string','Location':'string','Control':'string','Enrollment':'string','Founded':'string'})
df2 = pd.read_html(str(table))[1]
df2=df2.drop(['Cite'], axis=1)
df2.rename(columns = {'Location(s)': 'Location'}, inplace = True)
df2 = df2.astype({'School':'string','Location':'string','Founded':'string','Closed':'string'})
df_Maine=df1.merge(df2, on=['School','Location','Founded'])
df_Maine```
I am complete beginner in Python.
CodePudding user response:
Would agree with @Clegane that you may mean concat
, cause merge indicates that there have to be values in both that could be matched.
pd.concat([df1,df2], ignore_index=True)
School | Location | Control | Carnegie Classification | Enrollment | Founded | Closed | |
---|---|---|---|---|---|---|---|
0 | Albertus Magnus College | New Haven | Private(Catholic) | Masters university | 1793 | 1925 | nan |
1 | Asnuntuck Community College | Enfield | Public | Associates college | 2388 | 1969[6] | nan |
... | ... | ... | ... | ... | ... | ... | ... |
50 | Silvermine College of Art | New Canaan | Private | nan | nan | 1951 | 1971 |
51 | Wilcox College of Nursing | Middletown | Private | nan | nan | 1908 | 1997 |
Otherwise you should address the how
to make an outer join:
df_Connecticut=df1.merge(df2, on=['School','Location','Control','Founded'], how='outer')
CodePudding user response:
Here is a complete example pertaining to the two states you mentioned. Of course, you can include other states as well in your list.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
big_df = pd.DataFrame()
states = ['Maine', 'Connecticut']
for x in states:
r = s.get(f'https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_{x}')
soup = bs(r.text, 'html.parser')
df = pd.read_html(str(soup))[0]
df.columns = ['School', 'Location', 'Control', 'Type', 'Enrollment', 'Founded']
df['State'] = x
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)
Result in terminal:
School Location Control Type Enrollment Founded State
0 Bates College Lewiston Private Baccalaureate college 1964 1855 Maine
1 Beal University Bangor Private (for-profit) Associates college 229 1891[5] Maine
2 Bowdoin College Brunswick Private Baccalaureate college 1969 1794 Maine
3 Central Maine Community College Auburn Public Associates college 3746 1963[6] Maine
4 Colby College Waterville Private Baccalaureate college 2150 1813[7] Maine
... ... ... ... ... ... ... ...
61 University of New Haven West Haven Private Masters university 7587 1920 Connecticut
62 University of Saint Joseph West Hartford Private (Catholic) Masters university 2975 1932 Connecticut
63 Wesleyan University Middletown Private Baccalaureate college 3599 1831 Connecticut
64 Western Connecticut State University Danbury Public Masters university 6463 1903 Connecticut
65 Yale University New Haven Private Doctoral university 14910 1701 Connecticut
66 rows × 7 columns