Home > Mobile >  How to extract only header names from table into a list
How to extract only header names from table into a list

Time:11-13

I'm trying to extract just the header values from a Wikipedia table into a list. The following code is what I have so far, but I can't get the output correctly.

import requests
from bs4 import BeautifulSoup


page = requests.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find('table')



column_names = [item.get_text() for item in table.find_all('th')]   
column_names[2:18]

# current output: ['Origin of name[2][3]\n', 'Group\n','Period\n', 'Block\n' ...]    

# expected outout ['Atomic Number', 'Symbol', 'Name', 'Origin of name', 
#                'Group', 'Period', 'Standard atomic weight', 'Density', 
#                 'Melting Point'...]

CodePudding user response:

I believe you need to do some data cleaning based on how the html is structured. The table has a multiiindex structure, so you won't get a flat list as columns. Remember pandas has the from_html() function which allows you to pass a raw html string and it does the parsing for you, removing the need to use BeautifulSoup or do any html parsing.

Thinking pragmatically I believe for this particular case it's better to do it manually, otherwise you will need to do a lot of string manipulation to get a clean list of column names. It is faster to write it manually.

Given you have already done most of writing, for an easier and time efficient solution I recommend:

df = pd.read_html(page.text)[0]
column_names = ['Atomic Number', 'Symbol', 'Name', 'Origin of name', 'Group', 'Period','Block','Standard atomic weight', 'Density', 'Melting Point','Boiling Point','Specific heat capacity','Electro-negativity',"Abundance in Earth's crust",'Origin','Phase at r.t.']
df.columns = column_names

Which outputs a nice and readable:

     Atomic Number Symbol  ...      Origin  Phase at r.t.
0                1      H  ...  primordial            gas
1                2     He  ...  primordial            gas
2                3     Li  ...  primordial          solid
3                4     Be  ...  primordial          solid
4                5      B  ...  primordial          solid
..             ...    ...  ...         ...            ...
113            114     Fl  ...   synthetic  unknown phase
114            115     Mc  ...   synthetic  unknown phase
115            116     Lv  ...   synthetic  unknown phase
116            117     Ts  ...   synthetic  unknown phase
117            118     Og  ...   synthetic  unknown phase

Otherwise if you want to go for a fully-automated approach:

page = requests.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
df = pd.read_html(page.text)[0]
df.columns = df.columns.droplevel()

Outputs:

Element Origin of name[2][3]    Group   Period  Block   Standardatomicweight[a] Density[b][c]   Melting point[d]    Boiling point[e]    Specificheatcapacity[f] Electro­negativity[g]   Abundancein Earth'scrust[h] Origin[i]   Phase at r.t.[j]
Atomic number.mw-parser-output .nobold{font-weight:normal}Z Symbol  Name    Unnamed: 3_level_2  Unnamed: 4_level_2  Unnamed: 5_level_2  Unnamed: 6_level_2  (Da)    ('"`UNIQ--templatestyles-00000016-QINU`"'g/cm3) (K) (K) (J/g · K)   Unnamed: 12_level_2 (mg/kg) Unnamed: 14_level_2 Unnamed: 15_level_2
0   1   H   Hydrogen    Greek elements hydro- and -gen, 'water-forming' 1.0 1   s-block 1.008   0.00008988  14.01   20.28   14.304  2.20    1400    primordial  gas
1   2   He  Helium  Greek hḗlios, 'sun' 18.0    1   s-block 4.0026  0.0001785   –[k]    4.22    5.193   –   0.008   primordial  gas
2   3   Li  Lithium Greek líthos, 'stone'   1.0 2   s-block 6.94    0.534   453.69  1560    3.582   0.98    20  primordial  solid
3   4   Be  Beryllium   Beryl, a mineral (ultimately from the name of ...   2.0 2   s-block 9.0122  1.85    1560    2742    1.825   1.57    2.8 primordial  solid
4   5   B   Boron   Borax, a mineral (from Arabic bawraq)   13.0    2   p-block 10.81   2.34    2349    4200    1.026   2.04    10  primordial  solid
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113 114 Fl  Flerovium   Flerov Laboratory of Nuclear Reactions, part o...   14.0    7   p-block [289]   (9.928) (200)[b]    (380)   –   –   –   synthetic   unknown phase
114 115 Mc  Moscovium   Moscow, Russia, where the element was first sy...   15.0    7   p-block [290]   (13.5)  (700)   (1400)  –   –   –   synthetic   unknown phase
115 116 Lv  Livermorium Lawrence Livermore National Laboratory in Live...   16.0    7   p-block [293]   (12.9)  (700)   (1100)  –   –   –   synthetic   unknown phase
116 117 Ts  Tennessine  Tennessee, United States, where Oak Ridge Nati...   17.0    7   p-block [294]   (7.2)   (700)   (883)   –   –   –   synthetic   unknown phase
117 118 Og  Oganesson   Yuri Oganessian, Russian physicist  18.0    7   p-block [294]   (7) (325)   (450)   –   –   –   synthetic   unknown phase

And the string cleaning needed to make it look nice and tidy is going to take a lot longer than writing a few column names.

  • Related