Home > Software design >  Why can't Pandas Webscraping print out certain tables from this website?
Why can't Pandas Webscraping print out certain tables from this website?

Time:10-02

Is there a way to also print out the second table on this website? (the one that starts with CK Hutchison Holdings Ltd. )

Website link: https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en

This was my code:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd


headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'

r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
spec_table = soup.select('table[]')[1]
df = pd.read_html(str(spec_table))[0]
print(df[:5].to_markdown())

Whenever I tried to do spec_table = soup.select('table[]')[0], it would print the first table (the one that starts with Sun Hung Kai Properties Limited) and whenever I tried to do spec_table = soup.select('table[]')[1], it would skip the table in between (the one that starts with CK Hutchison Holdings Ltd. ) and print out the contract summary table.

Could anyone explain how I can print out the second table?

CodePudding user response:

This is one way to isolate and extract that second table from page:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd


headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'

r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
second_table = soup.select('table:-soup-contains("CK Hutchison Holdings Ltd.")')[0]
df = pd.read_html(str(second_table))[0]
print(df)

Result:

    No. SEHK Code   Underlying Stock Name   HKATS Code  Contract Size (shares)  Tier No.*   Position Limit ## (Effective from 1 April 2022) Approved by FSC Taiwan 
0   1   1   CK Hutchison Holdings Ltd.  CKH 500 1   50000   ✓
1   2   2   CLP Holdings Limited    CLP 500 1   50000   NaN
2   3   3   The Hong Kong and China Gas Company Limited HKG 1000    2   150000  NaN
3   4   4   The Wharf (Holdings) Limited    WHL 1000    1   50000   NaN
4   5   5   HSBC Holdings Plc.  HKB 400 2   150000  ✓
... ... ... ... ... ... ... ... ...
63  64  3323    China National Building Material Company Limited    NBM 2000    2   100000  ✓
64  65  3328    Bank of Communications Co., Ltd.    BCM 1000    3   150000  ✓
65  66  3968    China Merchants Bank Co., Ltd.  CMB 500 1   150000  ✓
66  67  3988    Bank of China Limited   XBC 1000    3   150000  ✓
67  68  6862    Haidilao International Holding Ltd. HDO 1000    1   100000  NaN

For documentation, refer to my response to your previous question. Also make sure your bs4 package is up to date, do a pip install -U bs4.

CodePudding user response:

apparently it is due to html error in the webpage: that webpage was clearly written by several people: if you check the html the tables (that look all the same) have each differet coding.

Anyway your problem is due to that table being called instead of other" class type

spec_table = soup.select('table[]')[0]

and you will have the tab you are missing.

As a general tip: if you have problem with webscraping search directly the html of the source (you can look for some string you get from the webpage to find the exact html point you are interested in, or just inspect with browser tools)

Alternatively, if you wanna have a more elegant and general code (for example if you need to iterate over all the tabs) you can do this:

start = soup.find('p', {'class': 'spanHeading'})
spec_table = start.find_all_next('table')

And then do what you wanted to do before:

df = pd.read_html(str(spec_table[1]))[0]
print(df[:5].to_markdown())
  • Related