Home > Enterprise >  scrape data to store into pandas dataframe
scrape data to store into pandas dataframe

Time:11-14

I am trying to scrape the table "List of chemical elements" from this website https://en.wikipedia.org/wiki/List_of_chemical_elements

I want to then store the table data into a pandas dataframe such that i can convert it into a csv file. So far i have scraped and stored the headers of the table into a dataframe. I also managed to retrieve each individual rows of data from the table. However, i am having trouble in storing the data for the table into the dataframe. Below is what i have so far

from bs4 import BeautifulSoup
import requests as r
import pandas as pd

response = r.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
wiki_text = response.text
soup = BeautifulSoup(wiki_text, 'html.parser')

table = soup.select_one('table.wikitable')

table_body = table.find('tbody')
#print(table_body)

rows = table_body.find_all('tr')

cols = [c.text.replace('\n', '') for c in rows[1].find_all('th')]

df2a = pd.DataFrame(columns = cols)
df2a

for row in rows:
    records = row.find_all('td')
    if records != []:
        records = [r.text.strip() for r in records]
        print(records)

CodePudding user response:

Here i have found all columns data in which it is divided to two parts first and second columns data

all_columns=soup.find_all("tr",attrs={"style":"vertical-align:top"})
first_column_data=[i.get_text(strip=True) for i in all_columns[0].find_all("th")]
second_column_data=[i.get_text(strip=True) for i in all_columns[1].find_all("th")]

Now as we need 16 columns so take appropriate columns and added data to new_lst list which is column list

new_lst=[]
new_lst.extend(second_column_data[:3])
new_lst.extend(first_column_data[1:])

Now we have to find row data iterate through all tr with attrs and find respectivetd and it will return list of table data and append to main_lst

main_lst=[]
for i in soup.find_all("tr",attrs={"class":"anchor"}):  
    row_data=[row.get_text(strip=True) for row in i.find_all("td")]
    main_lst.append(row_data)

Output:

Atomic numberZ  Symbol  Name    Origin of name[2][3]    Group   Period  Block   Standardatomicweight[a] Density[b][c]   Melting point[d]    Boiling point[e]    Specificheatcapacity[f] Electro­negativity[g]   Abundancein Earth'scrust[h] Origin[i]   Phase atr.t.[j]
0   1   H   Hydrogen    Greekelementshydro-and-gen, 'water-forming' 1   1   s-block 1.008   0.00008988  14.01   20.28   14.304  2.20    1400    primordial  gas
....

CodePudding user response:

Let pandas parse it for you:

import pandas as pd

df = pd.read_html('https://en.wikipedia.org/wiki/List_of_chemical_elements')[0]
df.to_csv('file.csv', index=False)
  • Related