Is there a way to delay my web scraper before it scrapes the page?-CodePudding

So here is my function :

def clubList(url,yearCode):
    print(url   "/clubs"   yearCode)
    response = requests.get(url   "/clubs"   yearCode)
    time.sleep(10)
    content = response.content
    soup = BeautifulSoup(content, "html.parser")
    cluburl = []
    clubs = []
    ul = soup.find_all(
        "ul",
        attrs={
            "class": "block-list-5 block-list-3-m block-list-1-s block-list-1-xs block-list-padding dataContainer"
        },
    )
    u = str(ul)
    soup2 = BeautifulSoup(u, "html.parser")
    for i, tags in enumerate(soup2.find_all("a")):
        cluburl.append(url   str(tags.get("href")))
    for i in range(0, len(cluburl)):
        cluburl[i] = cluburl[i].replace("overview", "squad")
    return cluburl

I'm trying to scrape the Premier league website to build a stat database for a data analysis project.

My current link tree looks like:

https://www.premierleague.com -> https://www.premierleague.com/clubs -> https://www.premierleague.com/clubs?se=418

The "?se=418" is the access code that I add to the link to specify which season's stats I would like to view, with each season having its own unique code.

I pass " https://www.premierleague.com " as the url and "?se=418" as the yearCode to my function, and it should return the list of links to the individual club's pages for that particular season. However, it always returns the club link list for the current season.

I've noticed that when I directly access the link https://www.premierleague.com/clubs?se=418 it first loads in the current season clubs and then dynamically refreshes in the appropriate ones.

So I thought adding a time delay might do the trick but I guess it is parsing the contents of the page in the requests.get statement and I'm not sure where I should add my delay to make this work.

Also here are all the modules you'll have to import to run the function:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import locale
import time

locale.setlocale(locale.LC_ALL, "en_US.UTF8")

CodePudding user response：

When you perform a season filter, it uses the following API:

GET https://footballapi.pulselive.com/football/teams

It needs the following http headers to return the data: account: premierleague and origin: https://www.premierleague.com

The following example uses the API to get the club list, and then extract club id and club name to generate the club url:

import requests

season = 418

r = requests.get("https://footballapi.pulselive.com/football/teams", 
    params = {
        "pageSize": 100,
        "compSeasons": season,
        "compCodeForActivePlayer": "null",
        "comps": 1,
        "altIds": "true",
        "page": 0
    },
    headers = {
        "account": "premierleague",
        "origin": "https://www.premierleague.com"
    }
)

data = r.json()
print([
    f'https://www.premierleague.com/clubs/{int(t["club"]["id"])}/{t["club"]["name"].replace(" ","-")}/squad'
    for t in data["content"]
])