I am trying to scrape all the table data reportTable on this website https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD (the website works only in the US) using beautiful soup and it only returns table head and empty body.
I have tried the following code:
url ='https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
usock = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
data = urlopen(usock).read()
soup = BS(data,"html.parser")
report_table=soup.find_all(id='reportTable')
I also tried using selenium and opening the url in chrome like this but still an empty body
chrome.get(url)
data = chrome.page_source
soup = BS(data,"html.parser")
report_table=soup.find_all(id='reportTable')
Does anyone have a workaround for this?
Thanks, R
CodePudding user response:
Data in that table is being hydrated via an XHR call to an API. You can scrape that API directly:
import requests
import pandas as pd
r = requests.get('https://www.ercot.com/misapp/servlets/IceDocListJsonWS?reportTypeId=13103&_=1665672400202')
df = pd.json_normalize(r.json(), record_path=['ListDocsByRptTypeRes', 'DocumentList'])
df['DocUrl'] = df.apply(lambda row: 'https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=' str(row['Document.DocID']), axis=1)
display(df)
As an extra, I included the url to download the actual document, if you want to.
Result in terminal:
Document.ExpiredDate Document.ILMStatus Document.SecurityStatus Document.ContentSize Document.Extension Document.ReportTypeID Document.Prefix Document.FriendlyName Document.ConstructedName Document.DocID Document.PublishDate Document.ReportName Document.DUNS Document.DocCount DocUrl
0 2022-11-13T23:59:59-06:00 EXT P 1335 zip 13103 cdr HRLYRESOUTCAPNP3233_csv cdr.00013103.0000000000000000.20221013.090214639.HRLYRESOUTCAPNP3233_csv.zip 870085714 2022-10-13T09:02:14-05:00 Hourly Resource Outage Capacity 0000000000000000 0 https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=870085714
1 2022-11-13T23:59:59-06:00 EXT P 2444 zip 13103 cdr HRLYRESOUTCAPNP3233_xml cdr.00013103.0000000000000000.20221013.090214585.HRLYRESOUTCAPNP3233_xml.zip 870085713 2022-10-13T09:02:14-05:00 Hourly Resource Outage Capacity 0000000000000000 0 https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=870085713
2 2022-11-13T23:59:59-06:00 EXT P 1337 zip 13103 cdr HRLYRESOUTCAPNP3233_csv cdr.00013103.0000000000000000.20221013.080215293.HRLYRESOUTCAPNP3233_csv.zip 870074851 2022-10-13T08:02:15-05:00 Hourly Resource Outage Capacity 0000000000000000 0 https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=870074851
3 2022-11-13T23:59:59-06:00 EXT P 2455 zip 13103 cdr HRLYRESOUTCAPNP3233_xml cdr.00013103.0000000000000000.20221013.080215212.HRLYRESOUTCAPNP3233_xml.zip 870074731 2022-10-13T08:02:15-05:00 Hourly Resource Outage Capacity 0000000000000000 0 https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=870074731
4 2022-11-13T23:59:59-06:00 EXT P 1331 zip 13103 cdr HRLYRESOUTCAPNP3233_csv cdr.00013103.0000000000000000.20221013.070239337.HRLYRESOUTCAPNP3233_csv.zip 870066327 2022-10-13T07:02:39-05:00 Hourly Resource Outage Capacity 0000000000000000 0 https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=870066327
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1503 2022-10-13T23:59:59-05:00 EXT P 2334 zip 13103 cdr HRLYRESOUTCAPNP3233_xml cdr.00013103.0000000000000000.20220912.020252183.HRLYRESOUTCAPNP3233_xml.zip 863677980 2022-09-12T02:02:52-05:00 Hourly Resource Outage Capacity 0000000000000000 0 https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=863677980
1504 2022-10-13T23:59:59-05:00 EXT P 1187 zip 13103 cdr HRLYRESOUTCAPNP3233_csv cdr.00013103.0000000000000000.20220912.010206931.HRLYRESOUTCAPNP3233_csv.zip 863670249 2022-09-12T01:02:06-05:00 Hourly Resource Outage Capacity 0000000000000000 0 https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=863670249
1505 2022-10-13T23:59:59-05:00 EXT P 2334 zip 13103 cdr HRLYRESOUTCAPNP3233_xml cdr.00013103.0000000000000000.20220912.010206880.HRLYRESOUTCAPNP3233_xml.zip 863670162 2022-09-12T01:02:06-05:00 Hourly Resource Outage Capacity 0000000000000000 0 https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=863670162
1506 2022-10-13T23:59:59-05:00 EXT P 1181 zip 13103 cdr HRLYRESOUTCAPNP3233_csv cdr.00013103.0000000000000000.20220912.000148740.HRLYRESOUTCAPNP3233_csv.zip 863662192 2022-09-12T00:01:48-05:00 Hourly Resource Outage Capacity 0000000000000000 0 https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=863662192
1507 2022-10-13T23:59:59-05:00 EXT P 2332 zip 13103 cdr HRLYRESOUTCAPNP3233_xml cdr.00013103.0000000000000000.20220912.000148691.HRLYRESOUTCAPNP3233_xml.zip 863662191 2022-09-12T00:01:48-05:00 Hourly Resource Outage Capacity 0000000000000000 0 https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=863662191
1508 rows × 15 columns
CodePudding user response:
The problem is that the table is dynamically loaded from java script. Here's version with selenium:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
chrome = webdriver.Chrome()
chrome.get(url)
time.sleep(5) # wait some time to load, you can also use WebDriverWait().until or something
# in the selenium you don't have to use bs to parse data, it's just
report_table = chrome.find_element(By.ID, "reportTable")
print(report_table)
Haven't tested this, but it should work.