I want to import a portion of html table into a dataframe.
Here is the table: In the below, I want to import only "Total Electric Industry"
I am running below code in the Google colab:
# Total residential customers
us_res_html = 'https://www.eia.gov/electricity/annual/html/epa_02_01.html'
us_res = pd.read_html(us_res_html)
us_res = us_res[1]
print(type(us_res))
print(us_res.columns)
print(us_res.dtypes)
# Consider only the total electric industry table
# The table starts with "Total Electric Industry"
# Next table starts with "Full-Service Providers"
# These texts were repeated in all the columns; Hence, search in the first column
start_idx = us_res[us_res[us_res.columns[0]]=="Total Electric Industry"].index
end_idx = us_res[us_res[us_res.columns[0]]=="Full-Service Providers"].index
Present output:
Int64Index([], dtype='int64')
CodePudding user response:
First remove non breaking spaces \xa0
from first column:
print (us_res.iloc[:, 0].head().tolist())
['Total\xa0Electric\xa0Industry', '2011', '2012', '2013', '2014']
us_res.iloc[:, 0] = us_res.iloc[:, 0].replace(r'\xa0',' ', regex=True)
Then is possible compare first column by DataFrame.iloc
and Series.cummax
for all values after first match, for invert mask use ~
and filter in boolean indexing
, last remove first row by position by iloc
with casting to integers:
df = (us_res[us_res.iloc[:, 0].eq("Total Electric Industry").cummax()
& ~us_res.iloc[:, 0].eq("Full-Service Providers").cummax()]
.iloc[1:].astype(int))
print (df)
Year Residential Commercial Industrial Transportation Total
1 2011 126143072 17638062 727920 92 144509146
2 2012 126832343 17729029 732385 83 145293840
3 2013 127777153 17679562 831790 75 146288580
4 2014 128680416 17853995 839212 79 147373702
5 2015 129811718 17985690 835536 78 148633022
6 2016 131068760 18148353 838059 86 150055258
7 2017 132579747 18359427 840329 86 151779589
8 2018 133893321 18605393 840321 83 153339118
9 2019 135249616 18694240 954222 83 154898161
10 2020 136682001 18848813 992311 83 156523208
11 2021 138308772 19102304 1022212 82 158433370
print (df.dtypes)
Year int32
Residential int32
Commercial int32
Industrial int32
Transportation int32
Total int32
dtype: object