Why can't Pandas Web Scraping print out any tables from this website?-CodePudding

I wrote this simple code with pandas webscraping which was supposed to extract data from this stocks website. However, once I run this code, it says "list index out of range", meaning that there are no tables on this website. If you open the website though, you can clearly see that there are multiple tables. Could anyone explain how I could fix it?

Website link: https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en

import pandas as pd

url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'
dfs = pd.read_html(url)

print(len(dfs)) #Gets the row count of the table

print(dfs[0]) #prints the first table

CodePudding user response：

Theere are some inconsistencies with tables in that page, from pandas perspective. Here is one way to get the first table on that page as a dataframe:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd


headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'

r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
spec_table = soup.select('table[]')[0]
df = pd.read_html(str(spec_table))[0]
print(df[:5].to_markdown())

This will return the dataframe:

	No.	SEHK Code	Underlying Stock Name	HKATS Code	Contract Size (shares)	Number of Board Lots	Tier No.*	Position Limit ## (Effective from 1 April 2022)	Approved by FSC Taiwan
0	1	16	Sun Hung Kai Properties Limited	SHK	1000	2	1	50000	✓
1	2	175	Geely Automobile Holdings Ltd.	GAH	5000	5	1	100000	✓
2	3	268	Kingdee International Software Group Co., Ltd.	KDS	2000	2	1	50000	nan
3	4	285	BYD Electronic International Company Limited	BYE	1000	2	1	50000	nan
4	5	288	WH Group Ltd.	WHG	2500	5	2	100000	nan

[...]

If you need other tables from page, just isolate them with BeautifulSoup and then read them with pandas. BeautifulSoup documentation: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

Pandas relevant documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html