Home > Enterprise >  Scraping table returning repeated values
Scraping table returning repeated values

Time:03-29

I'm trying to build a simple web scraper. I am trying to scrape a table, but I'm not sure why the output is: School, 20-5, 33.2 26 times over.

Here is my code:

from bs4 import BeautifulSoup
import requests

url = 'https://www.maxpreps.com/rankings/basketball/1/state/michigan.htm'
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
teams = soup.find_all('tr')
for team in teams:
    teamname = soup.find('th', class_ = "school").text
    record = soup.find('td', class_= "overall dw").text
    rating = soup.find('td', class_ = "rating sorted dw").text

    print(teamname, record, rating)

CodePudding user response:

Notice that you're never using the Tag that team refers to. Inside the for loop, all of the calls to soup.find() should be calls to team.find():

for team in teams[1:]:
    teamname = team.find('th', class_ = "school").text
    record = team.find('td', class_= "overall dw").text
    rating = team.find('td', class_ = "rating sorted dw").text
    print(teamname, record, rating)

This outputs:

St. Mary's Prep (Orchard Lake) 20-5 33.2
University of Detroit Jesuit (Detroit) 16-7 30.0
Williamston 25-0 29.3
Ferndale 21-3 28.9
Catholic Central (Grand Rapids) 25-1 28.4
King (Detroit) 18-3 27.4
De La Salle Collegiate (Warren) 18-7 27.2
Catholic Central (Novi) 16-9 26.6
Brother Rice (Bloomfield Hills) 15-7 26.5
Unity Christian (Hudsonville) 21-1 26.4
Hamtramck 21-4 26.3
Grand Blanc 20-5 25.9
East Lansing 18-5 25.0
Muskegon 20-3 24.8
Northview (Grand Rapids) 25-1 24.6
Cass Tech (Detroit) 21-4 24.3
North Farmington (Farmington Hills) 18-4 24.2
Beecher (Flint) 23-2 24.0
Okemos 19-5 23.9
Benton Harbor 23-3 23.2
Rockford 19-3 22.9
Grand Haven 17-4 21.9
Hartland 19-4 21.0
Marshall 20-3 21.0
Freeland 24-0 21.0

We use [1:] to skip the table header, slicing off the first element in the teams list.

CodePudding user response:

Let pandas parse that table for you (it uses BeautifulSoup under the hoop).

import pandas as pd

url = 'https://www.maxpreps.com/rankings/basketball/1/state/michigan.htm'
df = pd.read_html(url)[0]

Output:

print(df)
     #                                  School  Ovr.  Rating  Str.   /-
0    1          St. Mary's Prep (Orchard Lake)  20-5    33.2  23.0  NaN
1    2  University of Detroit Jesuit (Detroit)  16-7    30.0  24.1  NaN
2    3                             Williamston  25-0    29.3  10.9  NaN
3    4                                Ferndale  21-3    28.9  16.5  NaN
4    5         Catholic Central (Grand Rapids)  25-1    28.4  11.4  NaN
5    6                          King (Detroit)  18-3    27.4  15.2  NaN
6    7         De La Salle Collegiate (Warren)  18-7    27.2  19.6  2.0
7    8                 Catholic Central (Novi)  16-9    26.6  22.6 -1.0
8    9         Brother Rice (Bloomfield Hills)  15-7    26.5  21.0 -1.0
9   10           Unity Christian (Hudsonville)  21-1    26.4  10.4  NaN
10  11                               Hamtramck  21-4    26.3  14.5  2.0
11  12                             Grand Blanc  20-5    25.9  15.3 -1.0
12  13                            East Lansing  18-5    25.0  15.6  1.0
13  14                                Muskegon  20-3    24.8  11.4  1.0
14  15                Northview (Grand Rapids)  25-1    24.6   8.2  1.0
15  16                     Cass Tech (Detroit)  21-4    24.3  11.8 -4.0
16  17     North Farmington (Farmington Hills)  18-4    24.2  13.1  NaN
17  18                         Beecher (Flint)  23-2    24.0   8.6  2.0
18  19                                  Okemos  19-5    23.9  13.7 -1.0
19  20                           Benton Harbor  23-3    23.2   9.9 -1.0
20  21                                Rockford  19-3    22.9  11.6  NaN
21  22                             Grand Haven  17-4    21.9  11.3  NaN
22  23                                Hartland  19-4    21.0  10.4  1.0
23  24                                Marshall  20-3    21.0   8.6 -1.0
24  25                                Freeland  24-0    21.0   2.7  4.0
  • Related