Home > database >  Pandas Dataframe: Slice a part of read_html table into a dataframe
Pandas Dataframe: Slice a part of read_html table into a dataframe

Time:01-30

I want to import a portion of html table into a dataframe.

Here is the table: In the below, I want to import only "Total Electric Industry" enter image description here

I am running below code in the Google colab:

# Total residential customers
us_res_html = 'https://www.eia.gov/electricity/annual/html/epa_02_01.html'
us_res = pd.read_html(us_res_html)
us_res = us_res[1]
print(type(us_res)) 
print(us_res.columns)
print(us_res.dtypes)
# Consider only the total electric industry table
# The table starts with "Total Electric Industry"
# Next table starts with "Full-Service Providers"
# These texts were repeated in all the columns; Hence, search in the first column
start_idx = us_res[us_res[us_res.columns[0]]=="Total Electric Industry"].index 
end_idx = us_res[us_res[us_res.columns[0]]=="Full-Service Providers"].index 

Present output:

Int64Index([], dtype='int64')

CodePudding user response:

First remove non breaking spaces \xa0 from first column:

print (us_res.iloc[:, 0].head().tolist())
['Total\xa0Electric\xa0Industry', '2011', '2012', '2013', '2014']

us_res.iloc[:, 0] = us_res.iloc[:, 0].replace(r'\xa0',' ', regex=True)

Then is possible compare first column by DataFrame.iloc and Series.cummax for all values after first match, for invert mask use ~ and filter in boolean indexing, last remove first row by position by iloc with casting to integers:

df = (us_res[us_res.iloc[:, 0].eq("Total Electric Industry").cummax() 
             & ~us_res.iloc[:, 0].eq("Full-Service Providers").cummax()]
          .iloc[1:].astype(int))
print (df)
    Year  Residential  Commercial  Industrial  Transportation      Total
1   2011    126143072    17638062      727920              92  144509146
2   2012    126832343    17729029      732385              83  145293840
3   2013    127777153    17679562      831790              75  146288580
4   2014    128680416    17853995      839212              79  147373702
5   2015    129811718    17985690      835536              78  148633022
6   2016    131068760    18148353      838059              86  150055258
7   2017    132579747    18359427      840329              86  151779589
8   2018    133893321    18605393      840321              83  153339118
9   2019    135249616    18694240      954222              83  154898161
10  2020    136682001    18848813      992311              83  156523208
11  2021    138308772    19102304     1022212              82  158433370

print (df.dtypes)
Year              int32
Residential       int32
Commercial        int32
Industrial        int32
Transportation    int32
Total             int32
dtype: object
  • Related