Trying to parse the table with HDI and load the data into the Pandas DataFrame with columns: Country, HDI_score.
I'm stuck with loading the Nation column with the following code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')
df = pd.DataFrame(columns=['Countries', 'HDI_score'])
for row in table.find_all('tr'):
columns = row.find_all('td')
if(columns != []):
countries = columns[1].text.strip()
hdi_score = columns[2].text.strip()
df = df.append({'Countries': countries, 'HDI_score': hdi_score}, ignore_index=True)
So instead of having names of countries, I have values from column 'Rank changes over 5 years'. I've tried changing the column's index but it didn't help.
CodePudding user response:
You could use pandas to grab the appropriate table, match='Rank'
getting you the right table, then extract the columns of interest.
import pandas as pd
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index', match='Rank')[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)
As per comments, I see little point involving bs4 if you are still using pandas. See as below:
import pandas as pd
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(th:contains("Rank"))')))[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)
CodePudding user response:
Note Voted for QHarr because it would also be the most straightforward solution to use pandas
in my opinion
In addition and to answer your question - Selecting columns via BeautifulSoup
only would also be possible. Just combine css selectors
and stripped_strings
.
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')
pd.DataFrame(
[list(r.stripped_strings)[-3:-1] for r in bsObj.select('table tr:has(span[data-sort-value])')],
columns=['Countries', 'HDI_score']
)
Output
Countries | HDI_score |
---|---|
Norway | 0.957 |
Ireland | 0.955 |
Switzerland | 0.955 |
... | ... |