Transforming Python Dataframe to combine column and row headers into new table-CodePudding

Is there a way to combine a column and a row to make a new column where a value will be stored? I have a dataframe below that I would like to format to achieve the output at the bottom. Any help would be appreciated. The following table is extracted using beautifulsoup, pandas and selenium. I'll use the sixth df as an example.

Current code


s = Service(r'C:\Users\user34\Documents\Python\chromedriver.exe')
driver = webdriver.Chrome(service=s)

url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet"

driver.maximize_window()
driver.get(url)
time.sleep(5)

soup = BeautifulSoup(driver.page_source, 'lxml')
    
df = pd.read_html(str(soup))

df[6] #output below

   GENERATION   GENERATION.1    GENERATION.2    GENERATION.3
0   GROUP         MC               TNG               DCR
1   GAS          10884              7638             59
2   HYDRO         894               103              229
3   ENERGY STORAGE  70               0                56
4   SOLAR          1088              558              0
5   WIND           2780              170              0
6   OTHER           444              279              0
7   DUAL FUEL        0                0               0
8   COAL            1286             819              0
9   TOTAL           17446            9567            344

I would like to drop the column headers and the group value. The desired output is:

GAS MC   GAS TNG   GAS DCR   HYDRO MC    HYDRO TNG    HYDRO DCR 

10884     7638       59        894         103           229

Then extended for the rest of the data, I couldn't fit the whole desired output. Help with this would be greatly appreciated.

CodePudding user response：

Probably something like this (without the overheads of selenium) would be what you're after?

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs


pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

url = 'http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet'

r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select('table[border="1"]')[1]
df = pd.read_html(str(table), skiprows=1, index_col=['GROUP'])[0]
s = df.stack()
result_df = pd.DataFrame([s.values],  columns=[f'{i} {j}' for i, j in s.index])
print(result_df)

Result in terminal:

	GAS MC	GAS TNG	GAS DCR	HYDRO MC	HYDRO TNG	HYDRO DCR	ENERGY STORAGE MC	ENERGY STORAGE TNG	ENERGY STORAGE DCR	SOLAR MC	SOLAR TNG	SOLAR DCR	WIND MC	WIND TNG	WIND DCR	OTHER MC	OTHER TNG	OTHER DCR	DUAL FUEL MC	DUAL FUEL TNG	DUAL FUEL DCR	COAL MC	COAL TNG	COAL DCR	TOTAL MC	TOTAL TNG	TOTAL DCR
0	10884	7677	59	894	174	229	70	0	56	1088	509	0	2780	203	0	444	276	0	0	0	0	1286	809	0	17446	9648	344

Pandas documentation: https://pandas.pydata.org/docs/

Also requests: https://requests.readthedocs.io/en/latest/

And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/