Home > Enterprise >  Transforming Python Dataframe to combine column and row headers into new table
Transforming Python Dataframe to combine column and row headers into new table

Time:10-20

Is there a way to combine a column and a row to make a new column where a value will be stored? I have a dataframe below that I would like to format to achieve the output at the bottom. Any help would be appreciated. The following table is extracted using beautifulsoup, pandas and selenium. I'll use the sixth df as an example.

Current code


s = Service(r'C:\Users\user34\Documents\Python\chromedriver.exe')
driver = webdriver.Chrome(service=s)

url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet"

driver.maximize_window()
driver.get(url)
time.sleep(5)

soup = BeautifulSoup(driver.page_source, 'lxml')
    
df = pd.read_html(str(soup))

df[6] #output below

   GENERATION   GENERATION.1    GENERATION.2    GENERATION.3
0   GROUP         MC               TNG               DCR
1   GAS          10884              7638             59
2   HYDRO         894               103              229
3   ENERGY STORAGE  70               0                56
4   SOLAR          1088              558              0
5   WIND           2780              170              0
6   OTHER           444              279              0
7   DUAL FUEL        0                0               0
8   COAL            1286             819              0
9   TOTAL           17446            9567            344

I would like to drop the column headers and the group value. The desired output is:

GAS MC   GAS TNG   GAS DCR   HYDRO MC    HYDRO TNG    HYDRO DCR 

10884     7638       59        894         103           229

Then extended for the rest of the data, I couldn't fit the whole desired output. Help with this would be greatly appreciated.

CodePudding user response:

Probably something like this (without the overheads of selenium) would be what you're after?

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs


pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

url = 'http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet'

r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select('table[border="1"]')[1]
df = pd.read_html(str(table), skiprows=1, index_col=['GROUP'])[0]
s = df.stack()
result_df = pd.DataFrame([s.values],  columns=[f'{i} {j}' for i, j in s.index])
print(result_df)

Result in terminal:

GAS MC GAS TNG GAS DCR HYDRO MC HYDRO TNG HYDRO DCR ENERGY STORAGE MC ENERGY STORAGE TNG ENERGY STORAGE DCR SOLAR MC SOLAR TNG SOLAR DCR WIND MC WIND TNG WIND DCR OTHER MC OTHER TNG OTHER DCR DUAL FUEL MC DUAL FUEL TNG DUAL FUEL DCR COAL MC COAL TNG COAL DCR TOTAL MC TOTAL TNG TOTAL DCR
0 10884 7677 59 894 174 229 70 0 56 1088 509 0 2780 203 0 444 276 0 0 0 0 1286 809 0 17446 9648 344

Pandas documentation: https://pandas.pydata.org/docs/

Also requests: https://requests.readthedocs.io/en/latest/

And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/

  • Related