I'm trying to extract just the header values from a Wikipedia table into a list. The following code is what I have so far, but I can't get the output correctly.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find('table')
column_names = [item.get_text() for item in table.find_all('th')]
column_names[2:18]
# current output: ['Origin of name[2][3]\n', 'Group\n','Period\n', 'Block\n' ...]
# expected outout ['Atomic Number', 'Symbol', 'Name', 'Origin of name',
# 'Group', 'Period', 'Standard atomic weight', 'Density',
# 'Melting Point'...]
CodePudding user response:
I believe you need to do some data cleaning based on how the html is structured. The table has a multiiindex structure, so you won't get a flat list as columns. Remember pandas has the from_html()
function which allows you to pass a raw html string and it does the parsing for you, removing the need to use BeautifulSoup or do any html parsing.
Thinking pragmatically I believe for this particular case it's better to do it manually, otherwise you will need to do a lot of string manipulation to get a clean list of column names. It is faster to write it manually.
Given you have already done most of writing, for an easier and time efficient solution I recommend:
df = pd.read_html(page.text)[0]
column_names = ['Atomic Number', 'Symbol', 'Name', 'Origin of name', 'Group', 'Period','Block','Standard atomic weight', 'Density', 'Melting Point','Boiling Point','Specific heat capacity','Electro-negativity',"Abundance in Earth's crust",'Origin','Phase at r.t.']
df.columns = column_names
Which outputs a nice and readable:
Atomic Number Symbol ... Origin Phase at r.t.
0 1 H ... primordial gas
1 2 He ... primordial gas
2 3 Li ... primordial solid
3 4 Be ... primordial solid
4 5 B ... primordial solid
.. ... ... ... ... ...
113 114 Fl ... synthetic unknown phase
114 115 Mc ... synthetic unknown phase
115 116 Lv ... synthetic unknown phase
116 117 Ts ... synthetic unknown phase
117 118 Og ... synthetic unknown phase
Otherwise if you want to go for a fully-automated approach:
page = requests.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
df = pd.read_html(page.text)[0]
df.columns = df.columns.droplevel()
Outputs:
Element Origin of name[2][3] Group Period Block Standardatomicweight[a] Density[b][c] Melting point[d] Boiling point[e] Specificheatcapacity[f] Electronegativity[g] Abundancein Earth'scrust[h] Origin[i] Phase at r.t.[j]
Atomic number.mw-parser-output .nobold{font-weight:normal}Z Symbol Name Unnamed: 3_level_2 Unnamed: 4_level_2 Unnamed: 5_level_2 Unnamed: 6_level_2 (Da) ('"`UNIQ--templatestyles-00000016-QINU`"'g/cm3) (K) (K) (J/g · K) Unnamed: 12_level_2 (mg/kg) Unnamed: 14_level_2 Unnamed: 15_level_2
0 1 H Hydrogen Greek elements hydro- and -gen, 'water-forming' 1.0 1 s-block 1.008 0.00008988 14.01 20.28 14.304 2.20 1400 primordial gas
1 2 He Helium Greek hḗlios, 'sun' 18.0 1 s-block 4.0026 0.0001785 –[k] 4.22 5.193 – 0.008 primordial gas
2 3 Li Lithium Greek líthos, 'stone' 1.0 2 s-block 6.94 0.534 453.69 1560 3.582 0.98 20 primordial solid
3 4 Be Beryllium Beryl, a mineral (ultimately from the name of ... 2.0 2 s-block 9.0122 1.85 1560 2742 1.825 1.57 2.8 primordial solid
4 5 B Boron Borax, a mineral (from Arabic bawraq) 13.0 2 p-block 10.81 2.34 2349 4200 1.026 2.04 10 primordial solid
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113 114 Fl Flerovium Flerov Laboratory of Nuclear Reactions, part o... 14.0 7 p-block [289] (9.928) (200)[b] (380) – – – synthetic unknown phase
114 115 Mc Moscovium Moscow, Russia, where the element was first sy... 15.0 7 p-block [290] (13.5) (700) (1400) – – – synthetic unknown phase
115 116 Lv Livermorium Lawrence Livermore National Laboratory in Live... 16.0 7 p-block [293] (12.9) (700) (1100) – – – synthetic unknown phase
116 117 Ts Tennessine Tennessee, United States, where Oak Ridge Nati... 17.0 7 p-block [294] (7.2) (700) (883) – – – synthetic unknown phase
117 118 Og Oganesson Yuri Oganessian, Russian physicist 18.0 7 p-block [294] (7) (325) (450) – – – synthetic unknown phase
And the string cleaning needed to make it look nice and tidy is going to take a lot longer than writing a few column names.