Home > Net >  How to properly render scraped html page with javascript and selenium, to get missing class
How to properly render scraped html page with javascript and selenium, to get missing class

Time:09-13

I am trying to scrape some rugby statistics from pages that all look like this one (one per player): https://www.unitedrugby.com/clubs/benetton/filippo-alongi

This is just an example one.

First I set up a driver with selenium and then pass the content to BeautifulSoup for html exploration.

url = "https://www.unitedrugby.com/clubs/benetton/filippo-alongi"
driver = webdriver.Chrome( options=chrome_options)
driver.get(url)
soup = driver.page_source
soup = BeautifulSoup(soup, 'html.parser')
driver.quit()

At this point, I want to fetch the following class: player-hero__info-wrap. I do that with find_all(), which can find most things but not all of them.

By clicking on the link I provided, and inspecting the weight value (118KG) you will land very near to this tag in the inspector, so you can see that it exists.

However, when scraping it, I can't see it. I am using selenium because this page seems like it needs to be rendered with javascript before reading it, but I still can't see all classes.

I tried adding the following lines to execute javascript:

driver.execute_script("return document.documentElement.outerHTML;")

or even:

driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

But nothing.

Can anybody help me fetch this class?

CodePudding user response:

This is one way to obtain that info, with selenium only (why parse the page twice?):

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')

chrome_options.add_argument("window-size=1280,720")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(browser, 20)

url = 'https://www.unitedrugby.com/clubs/benetton/filippo-alongi'
browser.get(url)
try:
    wait.until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
    print('accepted cookies')
except Exception as e:
    print('no cookie button!')
player_stats = wait.until (EC.element_to_be_clickable((By.CSS_SELECTOR, 'div[]')))
print(player_stats.text)
### do other stuff, get other info, etc etc ###
browser.quit()

This will click away the annoying cookie popup (probably unnecessary in your scenario, but just in case you will try to interact with page), and print in terminal:

accepted cookies
AGE
22
HEIGHT
6'0''
WEIGHT
118KG

You don't really need BeautifulSoup when using Selenium, as it has powerful locators and finding methods. For documentation, please visit https://www.selenium.dev/documentation/

EDIT: And here is another solution based on requests/BeautifulSoup:

import requests
from bs4 import BeautifulSoup as bs

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

url = 'https://www.unitedrugby.com/clubs/benetton/filippo-alongi'

r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
player_data = soup.select_one('div.player-hero__info-wrap')
print(player_data.text.strip())

Result:

Age
22


Height
6'0''


Weight
 118KG

Relevant documentation: https://beautiful-soup-4.readthedocs.io/en/latest/index.html for BeautifulSoup and for requests: https://requests.readthedocs.io/en/latest/

  • Related