Home > Enterprise >  Dealing with run inconsistencies with web scraping
Dealing with run inconsistencies with web scraping

Time:10-01

I am scrapping data on mutual funds from the vanguard website and my code is giving me some inconsistencies in my data between runs. How can I make my scraping code more robust to avoid these?

I am scraping data from this page and trying to get the average duration in the characteristic table.

Sometimes all the tickers will go through with no problem and other times it will miss some of the data on the page. I assume this has to do with the scraping happening before the data fully loads but it only happens sometimes.

Here is the output for 2 back-to-back runs showing it successfully scraping a ticker and then missing the data on the following run.

VBIRX
Fund total net assets $74.7 billion
Number of bonds 2654
Average effective maturity 2.9 years
Average duration 2.8 years
Yield to maturity 0.5%
VSGBX
Fund total net assets $8.5 billion
Number of bonds 195
Average effective maturity 3.4 years
Average duration 1.7 years
Yield to maturity 0.4%
VFSTX # Here the data for VFSTX is successfully scraped
Fund total net assets $79.3 billion
Number of bonds 2519
Average effective maturity 2.8 years
Average duration 2.7 years
Yield to maturity 1.0%
VFISX
Fund total net assets $7.8 billion
Number of bonds 75
 2.2 years
 2.2 years
Yield to maturity 0.3%

# here data is missing for VFISX, Second run:

VBIRX
Fund total net assets $74.7 billion
Number of bonds 2654
Average effective maturity 2.9 years
Average duration 2.8 years
Yield to maturity 0.5%
VSGBX
Fund total net assets $8.5 billion
Number of bonds 195
Average effective maturity 3.4 years
Average duration 1.7 years
Yield to maturity 0.4%
VFSTX
Fund total net assets $79.3 billion
Number of bonds 2519
 2.8 years
 2.7 years
Yield to maturity 1.0%
# Here data is missing for VFSTX even though it worked in the previous run

The main issue is that for certain tickers the table is a different length so I am using a dictionary to store the data using the relevant label as a key. For some of the runs, the 'Average effective maturity' and 'Average duration' labels go missing, screwing up how I access the data.

As you can see from my output the code will work sometimes and I am not sure if choosing to wait for a different element to load on the page would fix it. How should I go about identifying my problem?

Here is the relevant code I am using:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import os
import csv


def extractOverviewTable(htmlTable):
    table = htmlTable.find('tbody')
    rows = table.findAll('tr')
    returnDict = {}
    for row in rows:
        cols = row.findAll('td')
        key = cols[0].find('span').text.replace('\n', '')
        value = cols[1].text.replace('\n', '')
        if 'Layer' in key:
            key = key[:key.index('Layer')]
        print(key, value)    
        returnDict[key] = value
        
    return returnDict
    


def main():

    dirname = os.path.dirname(__file__)
    symbols = []
    with open(os.path.join(dirname, 'symbols.csv')) as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            if row:
                symbols.append(row[0])
    symbols = [s.strip() for s in symbols if s.startswith('V')]    
    
    options = webdriver.ChromeOptions()
    options.page_load_strategy = 'normal'
    options.add_argument('--headless')
    browser = webdriver.Chrome(options=options, executable_path=os.path.join(dirname, 'chromedriver'))
    url_vanguard = 'https://investor.vanguard.com/mutual-funds/profile/overview/{}'
    
    for symbol in symbols:   
        browser.get(url_vanguard.format(symbol))
        print(symbol)
        WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH,'/html/body/div[1]/div[3]/div[3]/div[1]/div/div[1]/div/div/div/div[2]/div/div[2]/div[4]/div[2]/div[2]/div[1]/div/table/tbody/tr[4]')))
        html = browser.page_source
        mySoup = BeautifulSoup(html, 'html.parser')
        htmlData = mySoup.findAll('table',{'role':'presentation'})
        overviewDataList = extractOverviewTable(htmlData[2])

Here is a subset of the symbols.csv file I am using:

VBIRX
VSGBX
VFSTX
VFISX
VMLTX
VWSTX
VFIIX
VWEHX
VBILX
VFICX
VFITX

CodePudding user response:

Try EC.visibility_of_element_located instead of EC.presence_of_element_located, if that doesn't work try adding a time.sleep() for 1-2 seconds after the WebDriverWait statement.

  • Related