Home > database >  Rvest: using css selector pulls data from different tab in URL
Rvest: using css selector pulls data from different tab in URL

Time:03-22

I am very new to scraping, and am trying to pull data from a section of this website - https://projects.fivethirtyeight.com/soccer-predictions/premier-league/. The data I'm trying to get is in the second tab, "Matches," and is the section titled "Upcoming Matches."

I have attempted to do this with SelectorGadget and using rvest, as follows -

library(rvest)
url <- ("https://projects.fivethirtyeight.com/soccer-predictions/premier-league/")
url %>%
   html_nodes(".prob, .name") %>%
   html_text()

this returns values, however corresponding to the first tab on the page, "Standings." How can I reference the correct section that I am trying to pull?

CodePudding user response:

First:I don't know R but Python.

When you click Matches then page uses JavaScript to generate matches and it loads JSON data from:

https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_forecast.json

https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_matches.json

https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_clinches.json

I checked only one of them - 2021_premier-league_matches.json - and I see it has data for Completed Matches


I made example in Python:

import requests

url = 'https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_matches.json'

response = requests.get(url)
data = response.json() 

for item in data:
    # search date
    if item['datetime'].startswith('2022-03-16'):

        print('team1:', item['team1_code'], '|', item['team1'])
        print('prob1:', item['prob1'])
        print('score1:', item['score1'])
        print('adj_score1:', item['adj_score1'])
        print('chances1:', item['chances1'])
        print('moves1:', item['moves1'])
        print('---')

        print('team2:', item['team2_code'], '|', item['team2'])
        print('prob2:', item['prob2'])
        print('score2:', item['score2'])
        print('adj_score2:', item['adj_score2'])
        print('chances2:', item['chances2'])
        print('moves2:', item['moves2'])

        print('----------------------------------------')

Result:

team1: BHA | Brighton and Hove Albion
prob1: 0.30435
score1: 0
adj_score1: 0.0
chances1: 1.244
moves1: 1.682
---
team2: TOT | Tottenham Hotspur
prob2: 0.43627
score2: 2
adj_score2: 2.1
chances2: 1.924
moves2: 1.056
----------------------------------------
team1: ARS | Arsenal
prob1: 0.22114
score1: 0
adj_score1: 0.0
chances1: 0.569
moves1: 0.514
---
team2: LIV | Liverpool
prob2: 0.55306
score2: 2
adj_score2: 2.1
chances2: 1.243
moves2: 0.813
----------------------------------------
  • Related