Home > Net >  Using pd.read_html to return a specific table from a webpage of multiple tables
Using pd.read_html to return a specific table from a webpage of multiple tables

Time:02-11

I am trying to return a specific table from this webpage.

I have tried to scrape it using beautifulsoup but that got way too complicated so I an trying to use pd.read_html instead. The table I'm after is the one with "XYZ Substantial Shareholders" (note that this is just one particular page, I would be using this code to return tables from other stocks as well).

This is the code that I'm currently using to search for the tables - my initial idea was to find the word 'Holding' in the header of the title I'm after, but it has also picked up another table which I didn't want because the word 'Holdings' appeared in an earlier table.


# Scrape the Substantial Shareholder table if it exists
sub_table = pd.read_html(current_url, match='Holding')
print(sub_table)

Current output

                    Company   Mkt Cap         2018 A  2019 A   2020 A    2018 A     2019 A    2020 A             2018 A 2019 A 2020 A
0              Altium (ALU)   $4,528M         0.1377  0.3989   0.2353   96.9595    69.3097   56.1095              1.55%  1.28%  1.46%
1  Brainchip Holdings (BRN)   $2,974M         0.6838  0.0000   0.0000    0.0000     0.0000    0.0000              0.00%  0.00%  0.00%
2      Technology One (TNE)   $3,299M         0.1484  0.1679   0.1574   45.3375    38.8213   33.5414              1.36%  1.59%  1.84%
3     Wisetech Global (WTC)  $14,241M         1.2490  0.3624   0.2610  124.5434    91.4118   72.4917              0.15%  0.20%  0.30%
4                Xero (XRO)  $16,610M         2.6697 -0.8109  18.9607  920.3138  4866.8122  243.8197              0.00%  0.00%  0.00%,                 Holding                                        Name
0  176,305,508 (10.82%)                          Peter Van Der Made
1   119,205,377 (7.31%)  Merrill Lynch (Australia) Nominees Pty Ltd
2      114,093,915 (7%)   HSBC Custody Nominees (Australia) Limited]

Process finished with exit code 0

Desired output

Holding                                        Name
0  176,305,508 (10.82%)                          Peter Van Der Made
1   119,205,377 (7.31%)  Merrill Lynch (Australia) Nominees Pty Ltd
2      114,093,915 (7%)   HSBC Custody Nominees (Australia) Limited]

Is there a way for me to specifically match only looking at headers rather than the entire page content?

CodePudding user response:

You simply need to do this:

current_url = 'https://www.intelligentinvestor.com.au/shares/asx-brn/brainchip-holdings-ltd'

sub_table = pd.read_html(current_url, match='Holding')
print(sub_table[1])

Which returns:

          Holding                                        Name
0  176,305,508 (10.82%)                          Peter Van Der Made
1   119,205,377 (7.31%)  Merrill Lynch (Australia) Nominees Pty Ltd
2      114,093,915 (7%)   HSBC Custody Nominees (Australia) Limited
​

while

current_url = 'https://www.intelligentinvestor.com.au/shares/asx-brn/brainchip-holdings-ltd'

sub_table = pd.read_html(current_url, match='Holding')
print(sub_table[0])

gives

               Company   Mkt Cap EPS Growth (%)                   \
                    Company   Mkt Cap         2018 A  2019 A   2020 A   
0              Altium (ALU)   $4,536M         0.1377  0.4001   0.2353   
1  Brainchip Holdings (BRN)   $2,794M         0.6838  0.0000   0.0000   
2      Technology One (TNE)   $3,296M         0.1484  0.1679   0.1574   
3     Wisetech Global (WTC)  $14,456M         1.2490  0.3624   0.2610   
4                Xero (XRO)  $16,684M         2.6697 -0.8101  18.8826   

    P/E (%)                      Dividend Yield (%)                
     2018 A     2019 A    2020 A             2018 A 2019 A 2020 A  
0   97.1284    69.3746   56.1615              1.54%  1.28%  1.46%  
1    0.0000     0.0000    0.0000              0.00%  0.00%  0.00%  
2   45.2931    38.7833   33.5085              1.36%  1.60%  1.84%  
3  126.4269    92.7943   73.5880              0.15%  0.20%  0.29%  
4  924.4426  4867.3913  244.8065              0.00%  0.00%  0.00%  
​
  • Related