Home > OS >  How do I extract a specific column from a dataset using pandas that's imported from a HTML file
How do I extract a specific column from a dataset using pandas that's imported from a HTML file

Time:09-30

import requests
import os
import pandas as pd
from bs4 import BeautifulSoup

#Importing html
df = pd.read_html(os.path.expanduser("~/Documents/HTMLSpider/HTMLSpider_test/spotgamma.html"))
print (df['Latest Data'])

All of the documentation I can find online states that extracting a specific column from a dataset required you to specify the name of the column header in square braces, yet this is returning a TypeError when I try to do so:

>
    print (df['Latest Data'])
TypeError: list indices must be integers or slices, not str

If you're curious as to what the dataset looks like without trying to specify the column:

     SpotGamma Proprietary Levels Latest Data  ...    NDX    QQQ
0                        Ref Price:        4465  ...  15283    372
1        SpotGamma Imp. 1 Day Move:      0.91%,  ...    NaN    NaN
2        SpotGamma Imp. 5 Day Move:       2.11%  ...    NaN    NaN
3           SpotGamma Gamma Index™:        0.48  ...   0.04  -0.08
4              Volatility Trigger™:        4415  ...  15075    373
5  SpotGamma Absolute Gamma Strike:        4450  ...  15500    370
6               Gamma Notional(MM):        $157  ...     $4  $-397

CodePudding user response:

Note that

df = pd.read_html(os.path.expanduser("~/Documents/HTMLSpider/HTMLSpider_test/spotgamma.html"))

will return a list of dataframes, not a single one.

See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html ("Read HTML tables into a list of DataFrame objects.")

Better do

ldf = pd.read_html(os.path.expanduser("~/Documents/HTMLSpider/HTMLSpider_test/spotgamma.html"))

and then

df = ldf[0]  # replace 0 with the number of the dataframe you want

to get the first dataframe (there may be more, check len(ldf) to see how many you got and which one has the column you need).

  • Related