Home > Mobile >  Python web scraping with data rendered from javascripts
Python web scraping with data rendered from javascripts

Time:11-24

I want to scrape data from a website (https://nextgenftl.com/leagues/ftl-main-2022/game-weeks/week-30/players) rendered with javascript. I want to get all the players, and the badge, price, and price change of each player. How do I get all the data from the website after it's been rendered?

I'm trying to render the full page (including the scripts) before I scrape.

from requests_html import HTMLSession
from bs4 import BeautifulSoup

# Assign the URL,
# create the HTMLSession object,
# and run the "get" method to retrieve information from the URL
week = 30
url = f'https://nextgenftl.com/leagues/ftl-main-2022/game-weeks/week-{week}/players'
session = HTMLSession()
response = session.get(url)

# Check that the resolution code was 200
# (successfully retrieved info from URL)
res_code = response.status_code
print(res_code)
if res_code == 200:
    response.html.render() # This is the critical line. This render method runs the script tags to turn them into HTML

    # Get the page content
    soup = BeautifulSoup(response.content, 'lxml')
    print(soup.prettify())
    
else:
    print("Could not reach web page!")

I couldn't use BS4 because the page source does not contain the body (the body is all rendered from javascript). Also, I've seen through the network tab to see which Apis are giving out the data, but it didn't work. I also tried with selenium, but I still don't know how to scrape data from the website.

CodePudding user response:

Here is one way to get that info with Selenium. It's not fast, however it's reliable, and returns all players (725). Selenium setup is chromedriver/linux, you can adapt it to your own setup, just observe the imports and the code after defining the driver.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(driver, 25)
url = 'https://nextgenftl.com/leagues/ftl-main-2022/game-weeks/week-30/players'
big_list = []
driver.get(url)

for x in range(10):
    players = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//ion-list[not(@id="menu-list")]//ion-item')))
    for p in players:
        p.location_once_scrolled_into_view
    wait.until(EC.presence_of_element_located((By.TAG_NAME, 'ion-infinite-scroll'))).location_once_scrolled_into_view
    
    t.sleep(1)
players = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//ion-list[not(@id="menu-list")]//ion-item')))
for p in players:
    try:
        p.location_once_scrolled_into_view
        badge = p.find_element(By.XPATH, './/ion-badge').text
        name = p.find_element(By.XPATH, './/ion-label').text
        current_price = p.find_element(By.XPATH, './/div[@title="Current Price"]').text
        price_change = p.find_element(By.XPATH, './/div[@title="Price Change"]').text
        average_points = p.find_element(By.XPATH, './/div[@title="3-Week Average Points"]').text
        events_played = p.find_element(By.XPATH, './/div[@title="Events Played"]').text
        
        big_list.append((badge, name, current_price, price_change, average_points, events_played))
    except Exception as e:
        print('error')
        continue
t.sleep(2)
print(len(big_list))
df = pd.DataFrame(big_list, columns = ['badge', 'name', 'current_price', 'price_change', 'average_points', 'events_played'])
print(df)
df.to_csv('fantasy_tennis.csv')

This will display the dataframe/table in terminal, and also save it as csv:

725
badge   name    current_price   price_change    average_points  events_played
0   ATP Novak Djokovic  $19.864m    --  116.97  7
1   ATP Rafael Nadal    $19.295m    ↓ 1.137 53.92   9
2   WTA Iga Swiatek $17.835m    ↓ 0.074 72.70   13
3   WTA Ashleigh Barty  $16.800m    --  169.50  1
4   ATP Carlos Alcaraz  $15.587m    ↑ 0.494 74.14   14
... ... ... ... ... ... ...
720 WTA Dayana Yastremska   $1.450m ↓ 0.068 3.75    14
721 WTA Xiaodi You  $1.450m --  3.77    1
722 WTA Eleana Yu   $1.450m --  2.90    1
723 WTA Anastasia Zakharova $1.450m --  1.77    1
724 ATP Kacper Zuk  $1.450m --  4.16    1

See Selenium documentation at https://www.selenium.dev/documentation/

  • Related