How to scrape specific columns from table with BeautifulSoup and return as pandas dataframe-CodePudding

Trying to parse the table with HDI and load the data into the Pandas DataFrame with columns: Country, HDI_score.

I'm stuck with loading the Nation column with the following code:

import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')

df = pd.DataFrame(columns=['Countries', 'HDI_score'])
for row in table.find_all('tr'):    
    columns = row.find_all('td')
    
    if(columns != []):
        countries = columns[1].text.strip()
        hdi_score = columns[2].text.strip()
        df = df.append({'Countries': countries, 'HDI_score': hdi_score}, ignore_index=True)

Result from my code

So instead of having names of countries, I have values from column 'Rank changes over 5 years'. I've tried changing the column's index but it didn't help.

CodePudding user response：

You could use pandas to grab the appropriate table, match='Rank' getting you the right table, then extract the columns of interest.

import pandas as pd

table = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index', match='Rank')[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)

As per comments, I see little point involving bs4 if you are still using pandas. See as below:

import pandas as pd
from bs4 import BeautifulSoup as bs

r = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(th:contains("Rank"))')))[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)

CodePudding user response：

Note Voted for QHarr because it would also be the most straightforward solution to use pandas in my opinion

In addition and to answer your question - Selecting columns via BeautifulSoup only would also be possible. Just combine css selectors and stripped_strings.

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')

pd.DataFrame(
    [list(r.stripped_strings)[-3:-1] for r in bsObj.select('table tr:has(span[data-sort-value])')],
    columns=['Countries', 'HDI_score']
)

Output

Countries	HDI_score
Norway	0.957
Ireland	0.955
Switzerland	0.955
...	...