Beautiful Soup not working on this website-CodePudding

I want to scrape the URLs of all the items in the table but when I try, nothing comes up. The code is quite basic so I can see why it might not work. However, even trying to scrape the title of this website, nothing comes up. I at least expected the h1 tag as it's outside the table...

Website: https://www.vanguard.com.au/personal/products/en/overview

import requests
from bs4 import BeautifulSoup


lists =[]
url = 'https://www.vanguard.com.au/personal/products/en/overview'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

title = soup.find_all('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
    link_text = links['href']
    lists.append(link_text)

print(title)
print(lists)

CodePudding user response：

If the problem is caused by the JavaScript eventlistener, I would suggest you use beautifulsoup along with selenium to scrape this website. So, let's apply selenium at sending request and get back page source and then use beautifulsoup to parse it.

In addition, you should use title = soup.find() instead of title = soup.findall() in order to get only one title.

The example of code using Firefox:

from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from bs4 import BeautifulSoup


url = 'https://www.vanguard.com.au/personal/products/en/overview'
browser = webdriver.Firefox(executable_path=GeckoDriverManager().install())
browser.get(url)

soup = BeautifulSoup(browser.page_source, 'html.parser')
browser.close()

lists =[]
title = soup.find('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
    link_text = links['href']
    lists.append(link_text)

print(title)
print(lists)

Output:

<h1 >Investment products</h1>
['/personal/products/en/detail/8132', '/personal/products/en/detail/8219', '/personal/products/en/detail/8121',...,'/personal/products/en/detail/8217']

CodePudding user response：

The most common problem (with many modern pages): this page uses JavaScript to add elements but requests/BeautifulSoup can't run JavaScript.

You may need to use Selenium to control real web browser which can run JavaScript.

This example use only Selenium without BeautifulSoup

I use xpath but you may also use css selector.

from selenium import webdriver
from selenium.webdriver.common.by import By
             
url = 'https://www.vanguard.com.au/personal/products/en/overview'

lists = []

#driver = webdriver.Chrome(executable_path="/path/to/chromedrive.exe")
driver = webdriver.Firefox(executable_path="/path/to/geckodrive.exe")
driver.get(url)

title = driver.find_element(By.XPATH, '//h1[@]')
print(title.text)

all_items = driver.find_elements(By.XPATH, '//a[@style="padding-bottom: 1px;"]')

for links in all_items:
    link_text = links.get_attribute('href')
    print(link_text)
    lists.append(link_text)

ChromeDriver (for Chrome)
GeckoDriver (for Firefox)

CodePudding user response：

It's always more efficient to get the data from the source as opposed to doing it through Selenium. Looks like the links are created through the portId.

import pandas as pd
import requests


url = 'https://www3.vanguard.com.au/personal/products/funds.json'
payload = {
'context': '/personal/products/',
'countryCode': 'au.ret',
'paths': "[[['funds','legacyFunds'],'AU']]",
'method': 'get'}

jsonData = requests.get(url, params=payload).json()

results = jsonData['jsonGraph']['funds']['AU']['value']


df1 = pd.json_normalize(results, record_path=['children'])
df2 = pd.json_normalize(results, record_path=['listings'])


df = pd.concat([df1, df2], axis=0)
df['url_link'] = 'https://www.vanguard.com.au/personal/products/en/detail/'   df['portId']   '/Overview'

Output:

print(df[['fundName', 'url_link']])
                                             fundName                                           url_link
0         Vanguard Active Emerging Market Equity Fund  https://www.vanguard.com.au/personal/products/...
1             Vanguard Active Global Credit Bond Fund  https://www.vanguard.com.au/personal/products/...
2                  Vanguard Active Global Growth Fund  https://www.vanguard.com.au/personal/products/...
3   Vanguard Australian Corporate Fixed Interest I...  https://www.vanguard.com.au/personal/products/...
4       Vanguard Australian Fixed Interest Index Fund  https://www.vanguard.com.au/personal/products/...
..                                                ...                                                ...
23  Vanguard MSCI Australian Small Companies Index...  https://www.vanguard.com.au/personal/products/...
24  Vanguard MSCI Index International Shares (Hedg...  https://www.vanguard.com.au/personal/products/...
25       Vanguard MSCI Index International Shares ETF  https://www.vanguard.com.au/personal/products/...
26  Vanguard MSCI International Small Companies In...  https://www.vanguard.com.au/personal/products/...
27  Vanguard International Credit Securities Hedge...  https://www.vanguard.com.au/personal/products/...

[66 rows x 2 columns]