Home > Software engineering >  Is there a way to filter or remove data from Beautifulsoup?
Is there a way to filter or remove data from Beautifulsoup?

Time:01-15

I was attempting to web scrap a site with information about games and their schedules. Initially, I had success in importing all the relevant data into my program; however, once the games began playing this changed. The website removed the “time” column from its display which resulted in an uneven number of columns being imported into my program - one less than before as there was no “time” column anymore! This caused problems because now when I tried to construct a dataframe out of the collected information it would not work properly due to an unequal amount of entries within each row. I would like to import only those yet-to-be played.

import requests
from bs4 import BeautifulSoup

link = "https://www.espn.com/nfl/schedule/_/week/1/year/2022/seasontype/3"
page = requests.get(link)
soup = BeautifulSoup(page.content,"html.parser")

nfl_resp = soup.find_all('div',class_='ResponsiveTable')
visit = i.find_all(class_="events__col Table__TD")

nfl_list = []
nfl_time_list = []
nfl_location_list = []
visit_list = []

`for i in nfl_resp:`
    location = i.find_all(class_='location__col Table__TD')
    for team in location:
        nfl_location_list.append(team.text)

#I get all the correct stadiums 

for i in nfl_resp:
    time = i.find_all(class_='date__col Table__TD')
    for hour in time:
            nfl_time_list.append(hour.text)

#I get all the correct times

for i in nfl_resp:
    location = i.find_all(class_='location__col Table__TD')
    for team in location:
        nfl_location_list.append(team.text)

#I get all dates correctly

for team in visit:
    visit_list.append(team.text)

#Here's the problem, I get all the games regardless if they started or not.
#It only works if the games are yet to start, I need to run it when the games are running or over too.

CodePudding user response:

You can use this example that parses various information from the ESPN site:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://www.espn.com/nfl/schedule/_/week/1/year/2022/seasontype/3"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for row in soup.select(".Table__TR:has(.AnchorLink)"):
    data = [t.text for t in row.select(".AnchorLink:not(:has(img))")]
    networks = [
        n["alt"] if n.name == "img" else n.text
        for n in row.select(".network-container img, .network-container .network-name")
    ]
    date = row.find_previous(class_="Table__Title").text.strip()
    all_data.append([*data, networks, date])

df = pd.DataFrame(
    all_data,
    columns=["Team 1", "Team 2", "Time", "Tickets", "Stadium", "Networks", "Date"],
)
print(df)

Prints:

        Team 1         Team 2     Time                 Tickets                             Stadium            Networks                        Date
0      Seattle  San Francisco  4:30 PM  Tickets as low as $138     Levi's Stadium, Santa Clara, CA               [FOX]  Saturday, January 14, 2023
1  Los Angeles   Jacksonville  8:15 PM  Tickets as low as $138   TIAA Bank Field, Jacksonville, FL               [NBC]  Saturday, January 14, 2023
2        Miami        Buffalo  1:00 PM  Tickets as low as $114  Highmark Stadium, Orchard Park, NY               [CBS]    Sunday, January 15, 2023
3     New York      Minnesota  4:30 PM  Tickets as low as $116  U.S. Bank Stadium, Minneapolis, MN               [FOX]    Sunday, January 15, 2023
4    Baltimore     Cincinnati  8:15 PM  Tickets as low as $171      Paycor Stadium, Cincinnati, OH               [NBC]    Sunday, January 15, 2023
5       Dallas      Tampa Bay  8:15 PM  Tickets as low as $163    Raymond James Stadium, Tampa, FL  [ESPN, ABC, ESPN ]    Monday, January 16, 2023
  • Related