Home > Software design >  Scraping data from webpage with data download delay
Scraping data from webpage with data download delay

Time:09-24

I have tried looking at a couple of questions on this site with this problem but I can't get their solutions working. I am using python and selenium with a chrome headless browser to scrape bond data from vanguard. Vanguard loads the data on the page on a delay and I can't figure out how to get the data in properly.

I am trying to load data from this webpage, specifically the data from the fund facts table

When I tried doing this as I typically do I get

<iframe data-delayed-src="https://fls.doubleclick.net/activityi;src=844392;u7=vgmf;type=remar743;cat=mutua911;u1=prd;ord=1632433243910?" id="floodIframe" src="https://fls.doubleclick.net/activityi;src=844392;u7=vgmf;type=remar743;cat=mutua911;u1=prd;ord=1632433243910?"></iframe>

So I tried using this line of code to get the browser to wait until the data is loaded.

WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "data-ng-class")))

I am sure this is on the right track but I don't know how to properly tell what element I should be waiting to indentify and if I am doing it correctly. Is there a way for me to wait until the iframe data-delayed-src element goes away to get the data?

I have seen usages of it with By.ID but I don't see any elements in the data html that I want that have an id.

Here is the code I am using

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import os

dirname = os.path.dirname(__file__)
options = webdriver.ChromeOptions()
options.add_argument('--headless')
browser = webdriver.Chrome(options=options, executable_path=os.path.join(dirname, 'chromedriver'))
symbol = 'vbirx'
url_vanguard = 'https://investor.vanguard.com/mutual-funds/profile/overview/{}'
browser.get(url_vanguard.format(symbol))
# WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "data-ng-class")))

html = browser.page_source
mySoup = BeautifulSoup(html, 'html.parser')
htmlData = mySoup.find('table',{'role':'presentation'})
table = htmlData.find('tbody')
print('table: \n',table)

The table prints out missing all the data I want like this

 <tbody>
<!-- ngRepeat: item in genericTableData.items -->
</tbody>

CodePudding user response:

I used the XPath of the Fund facts table in the WebDriverWait statement to get it working.

Code snippet:-

symbol = 'vbirx'
url_vanguard = 'https://investor.vanguard.com/mutual-funds/profile/overview/{}'
browser.get(url_vanguard.format(symbol))

#waiting for the fund facts table to load
WebDriverWait(browser, 15).until(EC.presence_of_element_located((By.XPATH,'//*[@class="summary-table historical-table col2Wide"]')))

html = browser.page_source
mySoup = BeautifulSoup(html, 'html.parser')
htmlData = mySoup.find('table',{'role':'presentation'})
table = htmlData.find('tbody')
print('table: \n',table)
  • Related