Home > Software design >  Web scraping after a filter has been applied to data
Web scraping after a filter has been applied to data

Time:12-27

I'm trying to take data of the premier league website : https://www.premierleague.com/clubs/4/club/stats?se=15

My problem is when I'm taking the data from the site mentioned above I get the data from this site: https://www.premierleague.com/clubs/4/club/stats

So the data and URL changes after filtering to a different season but does not appear to change when I'm trying to take it from the site.

My code :

from bs4 import BeautifulSoup
import requests
import numpy as np

ChelseaReq  = requests.get("https://www.premierleague.com/clubs/4/club/stats?se=15")
ChelseaData = ChelseaReq.text

soup = BeautifulSoup(ChelseaData, "html.parser")
dataSet = np.array([])
dataSet1 = np.array([])
chelsea_db = {}
for stattext in soup.find_all("div",class_ ="normalStat"):

    chelsea_stat_numbers = stattext.span.text.split()[-1]
    chelsea_stat_numbers = chelsea_stat_numbers.replace(',','')
    chelsea_stat_numbers = chelsea_stat_numbers.replace('%','')
    dataSet = np.append(dataSet,float(chelsea_stat_numbers))

    chelsea_stat_attributes = ','.join(stattext.span.text.split()[0:-1])
    chelsea_stat_attributes = chelsea_stat_attributes.replace(',',' ')
    dataSet1 = np.append(dataSet1,chelsea_stat_attributes)

for A,B in zip(dataSet1,dataSet):
    chelsea_db[A] = B

chelsea_db

This prints the total data instead of the filtered data. How would I change it to return the filtered data instead?

e.g :

current output = 
'Goals': 1936.0,
'Goals per match': 1.71,
'Shots': 9954.0,  ... etc 

(after filtering the data on the website's filter button to a single season)
my goal =  
'Goals': 36,
'Goals per match': 1.71,
'Shots': 160,  ... etc 

CodePudding user response:

You don't get filtered data because this data is loaded by Javascript using XHR-request. But you can send this request directly and get all needed data in JSON format. So you don't even need to use BeautifulSoup. Here is code sample:

import requests
import json

headers = {
    'origin': 'https://www.premierleague.com',  # your get 403 Forbidden without this header
}
params = {
    "comps": 1,
    "compSeasons": 15  # number of season
}
chelsea_season_data = requests.get("https://footballapi.pulselive.com/football/stats/team/4",
                                   params=params, headers=headers)
data = json.loads(chelsea_season_data.text)
for stat in data['stats']:
    if stat['name'] == 'wins':
        print(f"Wins: {stat['value']}")
    elif stat['name'] == 'losses':
        print(f"Losses: {stat['value']}")
    elif stat['name'] == 'goals':
        print(f"Goals: {stat['value']}")
    elif stat['name'] == 'goals_conceded':
        print(f"Goals conceded: {stat['value']}")
    elif stat['name'] == 'clean_sheet':
        print(f"Clean sheets: {stat['value']}")
  • Related