I want to web scrap https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)
The code I have scraps only the headers
table1 = gdp[0]
body = table1.find_all("tr")
head = body[0]
headings = []
for item in head.find_all("th"):
item = (item.text).rstrip("\n")
headings.append(item)
df = pd.DataFrame(columns=headings)
df.head()```
I need help to scrap the header and sub headers[![enter image description here][1]][1]. The expectation is pandas data frame should look like [![enter image description here][2]][2]
[1]: https://i.stack.imgur.com/mBWOm.png
[2]: https://i.stack.imgur.com/Fun99.png
CodePudding user response:
Use read_html
with select third table, header=[0, 1]
is for MultiIndex
. Next step is flatten it - remove values after [
and join both levels is different in list comprehension:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
df = pd.read_html(url, header=[0, 1])[2]
df.columns = [(f'{a.split("[")[0]} {b}') if a!=b else a for a, b in df.columns]
print (df)
Country/Territory UN Region ... World Bank Estimate World Bank Year
0 World — ... 96100091 2021
1 United States Americas ... 22996100 2021
2 China Asia ... 17734063 2021
3 Japan Asia ... 4937422 2021
4 Germany Europe ... 4223116 2021
.. ... ... ... ... ...
212 Palau Oceania ... 258 2020
213 Kiribati Oceania ... 181 2020
214 Nauru Oceania ... 133 2021
215 Montserrat Americas ... — —
216 Tuvalu Oceania ... 63 2021
[217 rows x 8 columns]
If need also convert values to numeric use:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
df = pd.read_html(url, header=[0, 1], na_values=['—'])[2]
df.columns = [(f'{a.split("[")[0]} {b}') if a!=b else a for a, b in df.columns]
obj_cols = df.select_dtypes(object).columns
df[obj_cols] = df[obj_cols].apply(lambda x: x.str.split(']').str[-1])
df.iloc[:, 2:] = df.iloc[:, 2:].replace(',','', regex=True).apply(pd.to_numeric)
print (df.head())
Country/Territory UN Region IMF Estimate IMF Year \
0 World NaN 93863851.0 2021.0
1 United States Americas 25346805.0 2022.0
2 China Asia 19911593.0 2022.0
3 Japan Asia 4912147.0 2022.0
4 Germany Europe 4256540.0 2022.0
United Nations Estimate United Nations Year World Bank Estimate \
0 87461674.0 2020.0 96100091.0
1 20893746.0 2020.0 22996100.0
2 14722801.0 2020.0 17734063.0
3 5057759.0 2020.0 4937422.0
4 3846414.0 2020.0 4223116.0
World Bank Year
0 2021.0
1 2021.0
2 2021.0
3 2021.0
4 2021.0
CodePudding user response:
Check out pandas documentation pandas.read_html should help.
There will be a little bit of data munging left but can take you a long way:
import pandas as pd
data = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")
print(data)