I have tried looking at a couple of questions on this site with this problem but I can't get their solutions working. I am using python and selenium with a chrome headless browser to scrape bond data from vanguard. Vanguard loads the data on the page on a delay and I can't figure out how to get the data in properly.
I am trying to load data from this webpage, specifically the data from the fund facts table
When I tried doing this as I typically do I get
<iframe data-delayed-src="https://fls.doubleclick.net/activityi;src=844392;u7=vgmf;type=remar743;cat=mutua911;u1=prd;ord=1632433243910?" id="floodIframe" src="https://fls.doubleclick.net/activityi;src=844392;u7=vgmf;type=remar743;cat=mutua911;u1=prd;ord=1632433243910?"></iframe>
So I tried using this line of code to get the browser to wait until the data is loaded.
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "data-ng-class")))
I am sure this is on the right track but I don't know how to properly tell what element I should be waiting to indentify and if I am doing it correctly. Is there a way for me to wait until the iframe data-delayed-src element goes away to get the data?
I have seen usages of it with By.ID but I don't see any elements in the data html that I want that have an id.
Here is the code I am using
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import os
dirname = os.path.dirname(__file__)
options = webdriver.ChromeOptions()
options.add_argument('--headless')
browser = webdriver.Chrome(options=options, executable_path=os.path.join(dirname, 'chromedriver'))
symbol = 'vbirx'
url_vanguard = 'https://investor.vanguard.com/mutual-funds/profile/overview/{}'
browser.get(url_vanguard.format(symbol))
# WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "data-ng-class")))
html = browser.page_source
mySoup = BeautifulSoup(html, 'html.parser')
htmlData = mySoup.find('table',{'role':'presentation'})
table = htmlData.find('tbody')
print('table: \n',table)
The table prints out missing all the data I want like this
<tbody>
<!-- ngRepeat: item in genericTableData.items -->
</tbody>
CodePudding user response:
I used the XPath of the Fund facts table in the WebDriverWait statement to get it working.
Code snippet:-
symbol = 'vbirx'
url_vanguard = 'https://investor.vanguard.com/mutual-funds/profile/overview/{}'
browser.get(url_vanguard.format(symbol))
#waiting for the fund facts table to load
WebDriverWait(browser, 15).until(EC.presence_of_element_located((By.XPATH,'//*[@class="summary-table historical-table col2Wide"]')))
html = browser.page_source
mySoup = BeautifulSoup(html, 'html.parser')
htmlData = mySoup.find('table',{'role':'presentation'})
table = htmlData.find('tbody')
print('table: \n',table)