Home > Software design >  How to web-scrape data that may move indexes in the future
How to web-scrape data that may move indexes in the future

Time:10-22

I am trying to web scrape NFL standings data and am interested in the categories "PCT" and "Net Pts" from the table from this url. https://www.nfl.com/standings/league/2021/REG I have set up BeautifulSoup and printed the all 'td' in this page. The problem is when doing so you get an order of the teams from worst record to the best. Obviously this will cause problems in the future if I have a specific index that I have identified as the Lions' PCT for example, as when their record changes that data will have a different index. In fact the order of the teams on the website will change every week as more games are played.

Is there any way to say anything like if the name of the team is X do something? Like use the table data 4 indexes lower? I haven't seen how to deal with this problem on any youtube tutorial or book so I am wondering what the thought process is. I need a way to identify each team and their PCT and Net points instantaneously as this info will be put into another function.

Here is what I have so far for example: When you do something like this...

import requests
from bs4 import BeautifulSoup

url = 'https://www.nfl.com/standings/league/2021/REG'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
data = soup.find_all('td')[0:10]
print(data)
#I am using just the first 10 indexes to keep it short here

...you get the table data info for the Detroit Lions as they are the worst team in the league at the time of posting this question. I have identified that their "PCT" data point would be

win_pct = soup.find_all('td')[4]
print(float(win_pct.text.strip()))

However, if another team becomes the worst team in the league this index would belong to them and not the Lions. How would I work around this? Thanks

CodePudding user response:

You can use dictionary to store data about clubs and then use club name as a key to get the data (independent of club position). For example:

import requests
from bs4 import BeautifulSoup

url = "https://www.nfl.com/standings/league/2021/REG"
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")

data = {}
for row in soup.select("tr:has(td)"):
    cells = [td.get_text(strip=True) for td in row.select("td")[1:]]
    club_name = row.select_one(".d3-o-club-fullname").get_text(strip=True)
    data[club_name] = cells

# print PCT/Net Pts of Detroit Lions:
print(data["Detroit Lions"][3], data["Detroit Lions"][6])

Prints:

0.000 -63
  • Related