Home > Back-end >  Is there a way to webscrapp a site where everything has the same name?
Is there a way to webscrapp a site where everything has the same name?

Time:01-26

Hi ! I'm new to Beautifulsoup, I was trying to webscrapp the info from this website:

The problem is that when I try to inspect the elements on the website everything is called "td" and class"sch1". Therefore when I try to import I get a big mess. How can I import this information in a way that can be readible and usable, maybe I'll try build a dataframe with this.

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://feeds.donbest.com/schedulemembers/getRotation.html?bookType=1&eventDate=20230129"
get_url = requests.get(url).content
soup = BeautifulSoup(get_url,"html.parser")

title = soup.find_all("td","schtop1")
 rotation = soup.find_all("td","sch1")

 title_list = []
 rotation_list = []

 for mainT in title:
     title_list.append(mainT.text)
 print(title_list)

 for rot in rotation:
     rotation_list.append(rot.text)
print(rotation_list)

Output: ['NFL CONFERENCE CHAMPIONSHIPS', 'SUNDAY, JANUARY 29, 2023'] ['321', 'SAN FRANCISCO 49ERS', '', 'P: Sun Jan 29 12:00:00 PST 2023\xa0\n C: Sun Jan 29 14:00:00 PST 2023\xa0\n E: Sun Jan 29 15:00:00 PST 2023', '322', 'PHILADELPHIA EAGLES', '323', 'CINCINNATI BENGALS', '', 'P: Sun Jan 29 15:30:00 PST 2023\xa0\n C: Sun Jan 29 17:30:00 PST 2023\xa0\n E: Sun Jan 29 18:30:00 PST 2023', '324', 'KANSAS CITY CHIEFS']

I need to be able to use this information to build a pandas dataframe that looks like this:

Date Rot Visitor Visitor Rot Home Home PST ET CT
SUNDAY, JANUARY 29, 2023 321 SAN FRANCISCO 49ERS 322 PHILADELPHIA EAGLES Sun Jan 29 12:00:00 PST 2023 Sun Jan 29 15:00:00 PST C: Sun Jan 29 14:00:00 PST 2023
SUNDAY, JANUARY 29, 2023 323 PHILADELPHIA EAGLES 324 CINCINNATI BENGALS Sun Jan 29 15:30:00 PST Sun Jan 29 18:30:00 PST 2023 Sun Jan 29 17:30:00 PST 2023

I think I can build the dataframe if I can get the data in a more useful format.

CodePudding user response:

Try:

import re
import pandas as pd
import requests
from bs4 import BeautifulSoup


url = 'https://feeds.donbest.com/schedulemembers/getRotation.html?bookType=1&eventDate=20230129/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')


all_data = []
for t in soup.select('table:not(:has(table))'):
    rows = []
    for tr in t.select('tr'):
        tds = [td.text for td in tr.select('td')]
        rows.append(tds)
    all_data.append({
        'Date': soup.select('td[colspan="7"]')[1].text.strip(),
        'Rot Visitor': rows[0][0],
        'Visitor': rows[0][1],
        'Rot Home': rows[2][0],
        'Home': rows[2][1],
        'Dates': {k.strip(): v.strip() for k, v in re.findall(r'(?sm)(\S )\s*:(.*?)(?:[PEC]:|$)', rows[1][1])}
    })

df = pd.DataFrame(all_data)
df = pd.concat([df, df.pop('Dates').apply(pd.Series)], axis=1)
df = df.rename(columns={'P': 'PST', 'E': 'ET', 'C': 'CT'})
print(df.to_markdown())

Prints:

Date Rot Visitor Visitor Rot Home Home PST CT ET
0 SUNDAY, JANUARY 29, 2023 321 SAN FRANCISCO 49ERS 322 PHILADELPHIA EAGLES Sun Jan 29 12:00:00 PST 2023 Sun Jan 29 14:00:00 PST 2023 Sun Jan 29 15:00:00 PST 2023
1 SUNDAY, JANUARY 29, 2023 323 CINCINNATI BENGALS 324 KANSAS CITY CHIEFS Sun Jan 29 15:30:00 PST 2023 Sun Jan 29 17:30:00 PST 2023 Sun Jan 29 18:30:00 PST 2023

CodePudding user response:

import pandas as pd


df = pd.read_html(
    'https://feeds.donbest.com/schedulemembers/getRotation.html?bookType=1&eventDate=20230129/')[0]
print(df)

Output:

0                       NFL CONFERENCE CHAMPIONSHIPS  ...  NFL CONFERENCE CHAMPIONSHIPS
1                           SUNDAY, JANUARY 29, 2023  ...      SUNDAY, JANUARY 29, 2023
2  321  SAN FRANCISCO 49ERS  P: Sun Jan 29 12:00:...  ...                           NaN
3  323  CINCINNATI BENGALS  P: Sun Jan 29 15:30:0...  ...                           NaN

[4 rows x 7 columns]
  • Related