Web scrapping - Header and sub Header-CodePudding

I want to web scrap https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)

The code I have scraps only the headers

table1 = gdp[0]
body = table1.find_all("tr")
head = body[0] 
headings = []
for item in head.find_all("th"): 

        item = (item.text).rstrip("\n")

        headings.append(item)

df = pd.DataFrame(columns=headings)
df.head()```


I need help to scrap the header and sub headers[![enter image description here][1]][1]. The expectation is pandas data frame should look like [![enter image description here][2]][2]


  [1]: https://i.stack.imgur.com/mBWOm.png
  [2]: https://i.stack.imgur.com/Fun99.png

CodePudding user response：

Use read_html with select third table, header=[0, 1] is for MultiIndex. Next step is flatten it - remove values after [ and join both levels is different in list comprehension:

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
df = pd.read_html(url, header=[0, 1])[2]

df.columns = [(f'{a.split("[")[0]} {b}') if a!=b else a for a, b in df.columns]
print (df)
    Country/Territory UN Region  ... World Bank Estimate World Bank Year
0               World         —  ...            96100091            2021
1       United States  Americas  ...            22996100            2021
2               China      Asia  ...            17734063            2021
3               Japan      Asia  ...             4937422            2021
4             Germany    Europe  ...             4223116            2021
..                ...       ...  ...                 ...             ...
212             Palau   Oceania  ...                 258            2020
213          Kiribati   Oceania  ...                 181            2020
214             Nauru   Oceania  ...                 133            2021
215        Montserrat  Americas  ...                   —               —
216            Tuvalu   Oceania  ...                  63            2021

[217 rows x 8 columns]

If need also convert values to numeric use:

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
df = pd.read_html(url, header=[0, 1], na_values=['—'])[2]

df.columns = [(f'{a.split("[")[0]} {b}') if a!=b else a for a, b in df.columns]
obj_cols = df.select_dtypes(object).columns

df[obj_cols] = df[obj_cols].apply(lambda x: x.str.split(']').str[-1])

df.iloc[:, 2:] = df.iloc[:, 2:].replace(',','', regex=True).apply(pd.to_numeric)
print (df.head())
  Country/Territory UN Region  IMF Estimate  IMF Year  \
0             World       NaN    93863851.0    2021.0   
1     United States  Americas    25346805.0    2022.0   
2             China      Asia    19911593.0    2022.0   
3             Japan      Asia     4912147.0    2022.0   
4           Germany    Europe     4256540.0    2022.0   

   United Nations Estimate  United Nations Year  World Bank Estimate  \
0               87461674.0               2020.0           96100091.0   
1               20893746.0               2020.0           22996100.0   
2               14722801.0               2020.0           17734063.0   
3                5057759.0               2020.0            4937422.0   
4                3846414.0               2020.0            4223116.0   

   World Bank Year  
0           2021.0  
1           2021.0  
2           2021.0  
3           2021.0  
4           2021.0

CodePudding user response：

Check out pandas documentation pandas.read_html should help.

There will be a little bit of data munging left but can take you a long way:

import pandas as pd

data = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")

print(data)