I was attempting to web scrap a site with information about games and their schedules. Initially, I had success in importing all the relevant data into my program; however, once the games began playing this changed. The website removed the “time” column from its display which resulted in an uneven number of columns being imported into my program - one less than before as there was no “time” column anymore! This caused problems because now when I tried to construct a dataframe out of the collected information it would not work properly due to an unequal amount of entries within each row. I would like to import only those yet-to-be played.
import requests
from bs4 import BeautifulSoup
link = "https://www.espn.com/nfl/schedule/_/week/1/year/2022/seasontype/3"
page = requests.get(link)
soup = BeautifulSoup(page.content,"html.parser")
nfl_resp = soup.find_all('div',class_='ResponsiveTable')
visit = i.find_all(class_="events__col Table__TD")
nfl_list = []
nfl_time_list = []
nfl_location_list = []
visit_list = []
`for i in nfl_resp:`
location = i.find_all(class_='location__col Table__TD')
for team in location:
nfl_location_list.append(team.text)
#I get all the correct stadiums
for i in nfl_resp:
time = i.find_all(class_='date__col Table__TD')
for hour in time:
nfl_time_list.append(hour.text)
#I get all the correct times
for i in nfl_resp:
location = i.find_all(class_='location__col Table__TD')
for team in location:
nfl_location_list.append(team.text)
#I get all dates correctly
for team in visit:
visit_list.append(team.text)
#Here's the problem, I get all the games regardless if they started or not.
#It only works if the games are yet to start, I need to run it when the games are running or over too.
CodePudding user response:
You can use this example that parses various information from the ESPN site:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.espn.com/nfl/schedule/_/week/1/year/2022/seasontype/3"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for row in soup.select(".Table__TR:has(.AnchorLink)"):
data = [t.text for t in row.select(".AnchorLink:not(:has(img))")]
networks = [
n["alt"] if n.name == "img" else n.text
for n in row.select(".network-container img, .network-container .network-name")
]
date = row.find_previous(class_="Table__Title").text.strip()
all_data.append([*data, networks, date])
df = pd.DataFrame(
all_data,
columns=["Team 1", "Team 2", "Time", "Tickets", "Stadium", "Networks", "Date"],
)
print(df)
Prints:
Team 1 Team 2 Time Tickets Stadium Networks Date
0 Seattle San Francisco 4:30 PM Tickets as low as $138 Levi's Stadium, Santa Clara, CA [FOX] Saturday, January 14, 2023
1 Los Angeles Jacksonville 8:15 PM Tickets as low as $138 TIAA Bank Field, Jacksonville, FL [NBC] Saturday, January 14, 2023
2 Miami Buffalo 1:00 PM Tickets as low as $114 Highmark Stadium, Orchard Park, NY [CBS] Sunday, January 15, 2023
3 New York Minnesota 4:30 PM Tickets as low as $116 U.S. Bank Stadium, Minneapolis, MN [FOX] Sunday, January 15, 2023
4 Baltimore Cincinnati 8:15 PM Tickets as low as $171 Paycor Stadium, Cincinnati, OH [NBC] Sunday, January 15, 2023
5 Dallas Tampa Bay 8:15 PM Tickets as low as $163 Raymond James Stadium, Tampa, FL [ESPN, ABC, ESPN ] Monday, January 16, 2023