Home > Enterprise >  How to scrape a specific table with pandas without using its dataframe index?
How to scrape a specific table with pandas without using its dataframe index?

Time:06-28

I am currently trying to scrape an html table using pandas and tried using BeautifulSoup as well but am running into an issue doing so.

Here is the url: https://ciffc.net/en/ciffc/ext/member/sitrep/

Since the page is dynamic in nature and tables get added or removed daily, using the index of the pd dataframe is not an option. That said, here is the output I am looking to pull from the table using today's table index of 7 as an example.

display(df[7].iloc[1,2])

>> 'Yukon is at a level 3 prep level - but will trend upwards with the forecasted hot and dry weather.'

I don't have this issue scraping tables with a caption as I can use the match parameter of pandas.read_html, but this table doesn't have a caption. The data contained within the table is also very dynamic, with the only unique element I have been able to identify being the "Comments" column. Here is my attempt at identifying this table:

APLtable = pd.read_html(url, match='Comments')[0].head(14)
display(APLtable)  

Unfortunately this hasn't worked, telling me there is the following error

ValueError: No tables found matching pattern 'Comments'

I have also tried using BeautifulSoup without success and was wondering if anyone would know a way to refer to that specific table given the particularities of the webpage.

Here is the html table in question:

</div></div><div id="section-apl"  data-title="E: Preparedness Levels"><div id="apl_table_wrapper"><table >
 <thead><tr><th >Agency</th><th title="Agency Preparedness Level" >APL</th><th >Comments</th> </tr></thead>
<tbody>
 <tr id="apl-table-row-0" ><td>BC</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-1" ><td>YT</td><td>3</td><td>Yukon is at a level 3 prep level - but will trend upwards with the forecasted hot and dry weather.</td> </tr>
 <tr id="apl-table-row-2" ><td>AB</td><td>2</td><td></td> </tr>
 <tr id="apl-table-row-3" ><td>SK</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-4" ><td>MB</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-5" ><td>ON</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-6" ><td>QC</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-7" ><td>NL</td><td>2</td><td></td> </tr>
 <tr id="apl-table-row-8" ><td>NB</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-9" ><td>NS</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-10" ><td>PE</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-11" ><td>PC</td><td>1</td><td></td> </tr>
</tbody>
</table>

CodePudding user response:

The tables are, IMHO, actually static and I'd try this:

import requests
from bs4 import BeautifulSoup

import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}

soup = (
    BeautifulSoup(
        requests.get(
            "https://ciffc.net/en/ciffc/ext/member/sitrep/",
            headers=headers,
        ).text,
        "lxml",
    ).find("div", {"data-title": "E: Preparedness Levels"})
)

df = pd.read_html(str(soup), flavor="lxml")[0]
print(df)

This should consistently output:

   Agency  APL                                           Comments
0      BC    1                                                NaN
1      YT    3  Yukon is at a level 3 prep level - but will tr...
2      AB    2                                                NaN
3      SK    1                                                NaN
4      MB    1                                                NaN
5      ON    1                                                NaN
6      QC    1                                                NaN
7      NL    2                                                NaN
8      NB    1                                                NaN
9      NS    1                                                NaN
10     PE    1                                                NaN
11     PC    1                                                NaN
  • Related