Home > Software design >  Extracting data from table using Beautiful Soup where <tr> tags have both <th> and <t
Extracting data from table using Beautiful Soup where <tr> tags have both <th> and <t

Time:10-20

There are some tables I am trying to get data from. I have been able to do this before when the tags have only tags, however in this specific table they are mixed on several rows:

Picture of HTML Element

I was able to extract and tags separately using below, but then it's difficult to put the data back together.

url = 'https://www.pge.com/pipeline/operations/cgt_pipeline_status.page#flows'
res = requests.get(url)
file = BeautifulSoup(res.text, 'lxml')

##################################################################
find_table = file.find('table', class_='supply_demand_table')
rows = find_table.find_all('tr')

lps_td =[]

for i in rows:
    table_data = i.find_all('td')
    data = [j.text for j in table_data]
    lps_td.append(data)
    
df_td = pd.DataFrame(lps_td)

lps_th =[]

for i in rows:
    table_data = i.find_all('th')
    data = [j.text for j in table_data]
    lps_th.append(data)
    
df_th = pd.DataFrame(lps_th)

lps_th =[]

Any help on pulling the entire table would be really appreciated. Thanks!

CodePudding user response:

read_html returns a list of dataframes index 5 is the table you want.

import pandas as pd


url = "https://www.pge.com/pipeline/operations/cgt_pipeline_status.page#flows"
df = pd.read_html(url)[5].rename(columns={"Unnamed: 0": ""}).set_index("")
  • Related