I wrote this simple code with pandas webscraping which was supposed to extract data from this stocks website. However, once I run this code, it says "list index out of range", meaning that there are no tables on this website. If you open the website though, you can clearly see that there are multiple tables. Could anyone explain how I could fix it?
Website link: https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en
import pandas as pd
url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'
dfs = pd.read_html(url)
print(len(dfs)) #Gets the row count of the table
print(dfs[0]) #prints the first table
CodePudding user response:
Theere are some inconsistencies with tables in that page, from pandas perspective. Here is one way to get the first table on that page as a dataframe:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
spec_table = soup.select('table[]')[0]
df = pd.read_html(str(spec_table))[0]
print(df[:5].to_markdown())
This will return the dataframe:
No. | SEHK Code | Underlying Stock Name | HKATS Code | Contract Size (shares) | Number of Board Lots | Tier No.* | Position Limit ## (Effective from 1 April 2022) | Approved by FSC Taiwan | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 16 | Sun Hung Kai Properties Limited | SHK | 1000 | 2 | 1 | 50000 | ✓ |
1 | 2 | 175 | Geely Automobile Holdings Ltd. | GAH | 5000 | 5 | 1 | 100000 | ✓ |
2 | 3 | 268 | Kingdee International Software Group Co., Ltd. | KDS | 2000 | 2 | 1 | 50000 | nan |
3 | 4 | 285 | BYD Electronic International Company Limited | BYE | 1000 | 2 | 1 | 50000 | nan |
4 | 5 | 288 | WH Group Ltd. | WHG | 2500 | 5 | 2 | 100000 | nan |
[...]
If you need other tables from page, just isolate them with BeautifulSoup and then read them with pandas. BeautifulSoup documentation: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Pandas relevant documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html