How to properly render scraped html page with javascript and selenium, to get missing class-CodePudding

I am trying to scrape some rugby statistics from pages that all look like this one (one per player): https://www.unitedrugby.com/clubs/benetton/filippo-alongi

This is just an example one.

First I set up a driver with selenium and then pass the content to BeautifulSoup for html exploration.

url = "https://www.unitedrugby.com/clubs/benetton/filippo-alongi"
driver = webdriver.Chrome( options=chrome_options)
driver.get(url)
soup = driver.page_source
soup = BeautifulSoup(soup, 'html.parser')
driver.quit()

At this point, I want to fetch the following class: player-hero__info-wrap. I do that with find_all(), which can find most things but not all of them.

By clicking on the link I provided, and inspecting the weight value (118KG) you will land very near to this tag in the inspector, so you can see that it exists.

However, when scraping it, I can't see it. I am using selenium because this page seems like it needs to be rendered with javascript before reading it, but I still can't see all classes.

I tried adding the following lines to execute javascript:

driver.execute_script("return document.documentElement.outerHTML;")

or even:

driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

But nothing.

Can anybody help me fetch this class?

CodePudding user response：

This is one way to obtain that info, with selenium only (why parse the page twice?):

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')

chrome_options.add_argument("window-size=1280,720")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(browser, 20)

url = 'https://www.unitedrugby.com/clubs/benetton/filippo-alongi'
browser.get(url)
try:
    wait.until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
    print('accepted cookies')
except Exception as e:
    print('no cookie button!')
player_stats = wait.until (EC.element_to_be_clickable((By.CSS_SELECTOR, 'div[]')))
print(player_stats.text)
### do other stuff, get other info, etc etc ###
browser.quit()

This will click away the annoying cookie popup (probably unnecessary in your scenario, but just in case you will try to interact with page), and print in terminal:

accepted cookies
AGE
22
HEIGHT
6'0''
WEIGHT
118KG

You don't really need BeautifulSoup when using Selenium, as it has powerful locators and finding methods. For documentation, please visit https://www.selenium.dev/documentation/

EDIT: And here is another solution based on requests/BeautifulSoup:

import requests
from bs4 import BeautifulSoup as bs

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

url = 'https://www.unitedrugby.com/clubs/benetton/filippo-alongi'

r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
player_data = soup.select_one('div.player-hero__info-wrap')
print(player_data.text.strip())

Result:

Age
22


Height
6'0''


Weight
 118KG

Relevant documentation: https://beautiful-soup-4.readthedocs.io/en/latest/index.html for BeautifulSoup and for requests: https://requests.readthedocs.io/en/latest/