Home > Mobile >  Why can't Pandas Web Scraping print out any tables from this website?
Why can't Pandas Web Scraping print out any tables from this website?

Time:09-12

I wrote this simple code with pandas webscraping which was supposed to extract data from this stocks website. However, once I run this code, it says "list index out of range", meaning that there are no tables on this website. If you open the website though, you can clearly see that there are multiple tables. Could anyone explain how I could fix it?

Website link: https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en

import pandas as pd

url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'
dfs = pd.read_html(url)

print(len(dfs)) #Gets the row count of the table

print(dfs[0]) #prints the first table 

CodePudding user response:

Theere are some inconsistencies with tables in that page, from pandas perspective. Here is one way to get the first table on that page as a dataframe:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd


headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'

r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
spec_table = soup.select('table[]')[0]
df = pd.read_html(str(spec_table))[0]
print(df[:5].to_markdown())

This will return the dataframe:

No. SEHK Code Underlying Stock Name HKATS Code Contract Size (shares) Number of Board Lots Tier No.* Position Limit ## (Effective from 1 April 2022) Approved by FSC Taiwan
0 1 16 Sun Hung Kai Properties Limited SHK 1000 2 1 50000
1 2 175 Geely Automobile Holdings Ltd. GAH 5000 5 1 100000
2 3 268 Kingdee International Software Group Co., Ltd. KDS 2000 2 1 50000 nan
3 4 285 BYD Electronic International Company Limited BYE 1000 2 1 50000 nan
4 5 288 WH Group Ltd. WHG 2500 5 2 100000 nan

[...]

If you need other tables from page, just isolate them with BeautifulSoup and then read them with pandas. BeautifulSoup documentation: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

Pandas relevant documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

  • Related