Home > Software design >  Table data not returned then using beautiful soup
Table data not returned then using beautiful soup

Time:10-14

I am trying to scrape all the table data reportTable on this website https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD (the website works only in the US) using beautiful soup and it only returns table head and empty body.

I have tried the following code:

url ='https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
usock = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
data = urlopen(usock).read()
soup = BS(data,"html.parser")
report_table=soup.find_all(id='reportTable')

I also tried using selenium and opening the url in chrome like this but still an empty body

chrome.get(url)
data = chrome.page_source
soup = BS(data,"html.parser")
report_table=soup.find_all(id='reportTable')

Does anyone have a workaround for this?

Thanks, R

CodePudding user response:

Data in that table is being hydrated via an XHR call to an API. You can scrape that API directly:

import requests
import pandas as pd

r = requests.get('https://www.ercot.com/misapp/servlets/IceDocListJsonWS?reportTypeId=13103&_=1665672400202')
df = pd.json_normalize(r.json(), record_path=['ListDocsByRptTypeRes', 'DocumentList'])
df['DocUrl'] = df.apply(lambda row: 'https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId='   str(row['Document.DocID']), axis=1)
display(df)

As an extra, I included the url to download the actual document, if you want to.

Result in terminal:

Document.ExpiredDate    Document.ILMStatus  Document.SecurityStatus Document.ContentSize    Document.Extension  Document.ReportTypeID   Document.Prefix Document.FriendlyName   Document.ConstructedName    Document.DocID  Document.PublishDate    Document.ReportName Document.DUNS   Document.DocCount   DocUrl
0   2022-11-13T23:59:59-06:00   EXT P   1335    zip 13103   cdr HRLYRESOUTCAPNP3233_csv cdr.00013103.0000000000000000.20221013.090214639.HRLYRESOUTCAPNP3233_csv.zip    870085714   2022-10-13T09:02:14-05:00   Hourly Resource Outage Capacity 0000000000000000    0   https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=870085714
1   2022-11-13T23:59:59-06:00   EXT P   2444    zip 13103   cdr HRLYRESOUTCAPNP3233_xml cdr.00013103.0000000000000000.20221013.090214585.HRLYRESOUTCAPNP3233_xml.zip    870085713   2022-10-13T09:02:14-05:00   Hourly Resource Outage Capacity 0000000000000000    0   https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=870085713
2   2022-11-13T23:59:59-06:00   EXT P   1337    zip 13103   cdr HRLYRESOUTCAPNP3233_csv cdr.00013103.0000000000000000.20221013.080215293.HRLYRESOUTCAPNP3233_csv.zip    870074851   2022-10-13T08:02:15-05:00   Hourly Resource Outage Capacity 0000000000000000    0   https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=870074851
3   2022-11-13T23:59:59-06:00   EXT P   2455    zip 13103   cdr HRLYRESOUTCAPNP3233_xml cdr.00013103.0000000000000000.20221013.080215212.HRLYRESOUTCAPNP3233_xml.zip    870074731   2022-10-13T08:02:15-05:00   Hourly Resource Outage Capacity 0000000000000000    0   https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=870074731
4   2022-11-13T23:59:59-06:00   EXT P   1331    zip 13103   cdr HRLYRESOUTCAPNP3233_csv cdr.00013103.0000000000000000.20221013.070239337.HRLYRESOUTCAPNP3233_csv.zip    870066327   2022-10-13T07:02:39-05:00   Hourly Resource Outage Capacity 0000000000000000    0   https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=870066327
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1503    2022-10-13T23:59:59-05:00   EXT P   2334    zip 13103   cdr HRLYRESOUTCAPNP3233_xml cdr.00013103.0000000000000000.20220912.020252183.HRLYRESOUTCAPNP3233_xml.zip    863677980   2022-09-12T02:02:52-05:00   Hourly Resource Outage Capacity 0000000000000000    0   https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=863677980
1504    2022-10-13T23:59:59-05:00   EXT P   1187    zip 13103   cdr HRLYRESOUTCAPNP3233_csv cdr.00013103.0000000000000000.20220912.010206931.HRLYRESOUTCAPNP3233_csv.zip    863670249   2022-09-12T01:02:06-05:00   Hourly Resource Outage Capacity 0000000000000000    0   https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=863670249
1505    2022-10-13T23:59:59-05:00   EXT P   2334    zip 13103   cdr HRLYRESOUTCAPNP3233_xml cdr.00013103.0000000000000000.20220912.010206880.HRLYRESOUTCAPNP3233_xml.zip    863670162   2022-09-12T01:02:06-05:00   Hourly Resource Outage Capacity 0000000000000000    0   https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=863670162
1506    2022-10-13T23:59:59-05:00   EXT P   1181    zip 13103   cdr HRLYRESOUTCAPNP3233_csv cdr.00013103.0000000000000000.20220912.000148740.HRLYRESOUTCAPNP3233_csv.zip    863662192   2022-09-12T00:01:48-05:00   Hourly Resource Outage Capacity 0000000000000000    0   https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=863662192
1507    2022-10-13T23:59:59-05:00   EXT P   2332    zip 13103   cdr HRLYRESOUTCAPNP3233_xml cdr.00013103.0000000000000000.20220912.000148691.HRLYRESOUTCAPNP3233_xml.zip    863662191   2022-09-12T00:01:48-05:00   Hourly Resource Outage Capacity 0000000000000000    0   https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=863662191
1508 rows × 15 columns

CodePudding user response:

The problem is that the table is dynamically loaded from java script. Here's version with selenium:

import time

from selenium import webdriver
from selenium.webdriver.common.by import By

chrome = webdriver.Chrome()
chrome.get(url)

time.sleep(5) # wait some time to load, you can also use WebDriverWait().until or something 

# in the selenium you don't have to use bs to parse data, it's just 
report_table = chrome.find_element(By.ID, "reportTable")

print(report_table)

Haven't tested this, but it should work.

  • Related