Home > Back-end >  Problem with scraping JSON data from website
Problem with scraping JSON data from website

Time:03-27

I am trying to scrape this website for the data in the table: https://investor.vanguard.com/etf/profile/overview/ESGV/portfolio-holdings

I have inspected the website and found that the data came from a JSON table through an external link. This is my code trying to target that link through headers and payloads:

import pandas as pd
import requests
import scraper_helper

headers = """ XXX """
headers = scraper_helper.get_dict(headers,strip_cookie=False)

url = 'https://api.vanguard.com/rs/ire/01/ind/fund/4393/portfolio-holding/stock.jsonp'
payload = {
'callback': 'angular.callbacks._m',
'planId': 'null',
'asOfType': 'daily',
'start': '1',
'count': '1527'}

jsonData = requests.get(url, params=payload).json()
results = jsonData['fund']['entity']

df2 = pd.json_normalize(results, record_path=['portfolioHolding'])
df2 = pd.DataFrame(df2,index=list(range(len(df2))))
print(df2)

When clicking that link manually in a browser, an error pops up. "We're sorry. The page you requested could not be found." This is usually no problem. I have scraped several websites where the JSON data link comes up as an error on the browser but still works in Python. However this time, the error also comes up in Python. I can't bypass it for some reason.

How can I fix this? Thanks!

CodePudding user response:

It seems their endpoint requires the Referer header to be set to https://investor.vanguard.com/.

Try this:

requests.get(url, params=payload, headers={ 'Referer': 'https://investor.vanguard.com/' }).text

I note that the response isn't quite JSON, the JSON is wrapped in angular.callbacks._m( … ).

  • Related