Home > Mobile >  scrape sports reference table
scrape sports reference table

Time:08-24

I have tried the following script to make to grab the table on the webpage.

from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.sports-reference.com/cfb/play-index/rivals.cgi?request=1&school_id=penn-state&opp_id=purdue'

headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, 'html.parser')

soup.find('tbody')

However, the table is not able to be pulled. Not even a "pd.read_html" line works. Is there a reason for that?

CodePudding user response:

The desired table data is under html comment. By removing the comment,you can extract the table data using pandas only.

import pandas as pd
import requests
from bs4 import BeautifulSoup

url= 'https://www.sports-reference.com/cfb/play-index/rivals.cgi?request=1&school_id=penn-state&opp_id=purdue'
res = requests.get(url).text.replace('<!--', '').replace('-->', '')

soup =BeautifulSoup(res,'lxml')

table = soup.select_one('#div_results')

df = pd.read_html(str(table))[0]
d = df.droplevel(0, axis=1)
print(d)

Output:

 G        Date  Day           School Unnamed: 4_level_1     Opponent  ... Diff   W  L  T  Streak  Notes
0   19  2019-10-05  Sat  Penn State (12)                NaN       Purdue  ...   28  15  3  1     W 9    NaN
1   18  2016-10-29  Sat  Penn State (24)                  @       Purdue  ...   38  14  3  1     W 8    NaN
2   17  2013-11-16  Sat       Penn State                NaN       Purdue  ...   24  13  3  1     W 7    NaN
3   16  2012-11-03  Sat       Penn State                  @       Purdue  ...   25  12  3  1     W 6    NaN
4   15  2011-10-15  Sat       Penn State                NaN       Purdue  ...    5  11  3  1     W 5    NaN
5   14  2008-10-04  Sat   Penn State (6)                  @       Purdue  ...   14  10  3  1     W 4    NaN
6   13  2007-11-03  Sat       Penn State                NaN       Purdue  ...    7   9  3  1     W 3    NaN
7   12  2006-10-28  Sat       Penn State                  @       Purdue  ...   12   8  3  1     W 2    NaN
8   11  2005-10-29  Sat  Penn State (11)                NaN       Purdue  ...   18   7  3  1     W 1    NaN
9   10  2004-10-09  Sat       Penn State                NaN   Purdue (9)  ...   -7   6  3  1     L 2    NaN
10   9  2003-10-11  Sat       Penn State                  @  Purdue (18)  ...  -14   6  2  1     L 1    NaN
11   8  2000-09-30  Sat       Penn State                NaN  Purdue (22)  ...    2   6  1  1     W 6    NaN
12   7  1999-10-23  Sat   Penn State (2)                  @  Purdue (16)  ...    6   5  1  1     W 5    NaN
13   6  1998-10-17  Sat  Penn State (12)                NaN       Purdue  ...   18   4  1  1     W 4    NaN
14   5  1997-11-15  Sat   Penn State (6)                  @  Purdue (19)  ...   25   3  1  1     W 3    NaN
15   4  1996-10-12  Sat  Penn State (10)                NaN       Purdue  ...   17   2  1  1     W 2    NaN
16   3  1995-10-14  Sat  Penn State (20)                  @       Purdue  ...    3   1  1  1     W 1    NaN
17   2  1952-09-27  Sat       Penn State                NaN       Purdue  ...    0   0  1  1     T 1    NaN
18   1  1951-11-03  Sat       Penn State                  @       Purdue  ...  -28   0  1  0     L 1    NaN

[19 rows x 16 columns]

CodePudding user response:

The <table> is stored inside HTML comment <!-- --> so beautifulsoup normally doesn't see it. To parse it you can use next example:

import requests
import pandas as pd
from bs4 import BeautifulSoup, Comment


url = "https://www.sports-reference.com/cfb/play-index/rivals.cgi?request=1&school_id=penn-state&opp_id=purdue"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
}

pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, "html.parser")

df = pd.read_html("\n".join(soup.find_all(text=Comment)))[0]
df = df.droplevel(0, axis=1)
print(df)

Prints:

     G        Date  Day           School Unnamed: 4_level_1     Opponent     Conf Unnamed: 7_level_1  Pts  Opp  Diff   W  L  T Streak  Notes
0   19  2019-10-05  Sat  Penn State (12)                NaN       Purdue  Big Ten                  W   35    7    28  15  3  1    W 9    NaN
1   18  2016-10-29  Sat  Penn State (24)                  @       Purdue  Big Ten                  W   62   24    38  14  3  1    W 8    NaN
2   17  2013-11-16  Sat       Penn State                NaN       Purdue  Big Ten                  W   45   21    24  13  3  1    W 7    NaN
3   16  2012-11-03  Sat       Penn State                  @       Purdue  Big Ten                  W   34    9    25  12  3  1    W 6    NaN
4   15  2011-10-15  Sat       Penn State                NaN       Purdue  Big Ten                  W   23   18     5  11  3  1    W 5    NaN
5   14  2008-10-04  Sat   Penn State (6)                  @       Purdue  Big Ten                  W   20    6    14  10  3  1    W 4    NaN
6   13  2007-11-03  Sat       Penn State                NaN       Purdue  Big Ten                  W   26   19     7   9  3  1    W 3    NaN
7   12  2006-10-28  Sat       Penn State                  @       Purdue  Big Ten                  W   12    0    12   8  3  1    W 2    NaN
8   11  2005-10-29  Sat  Penn State (11)                NaN       Purdue  Big Ten                  W   33   15    18   7  3  1    W 1    NaN
9   10  2004-10-09  Sat       Penn State                NaN   Purdue (9)  Big Ten                  L   13   20    -7   6  3  1    L 2    NaN
10   9  2003-10-11  Sat       Penn State                  @  Purdue (18)  Big Ten                  L   14   28   -14   6  2  1    L 1    NaN
11   8  2000-09-30  Sat       Penn State                NaN  Purdue (22)  Big Ten                  W   22   20     2   6  1  1    W 6    NaN
12   7  1999-10-23  Sat   Penn State (2)                  @  Purdue (16)  Big Ten                  W   31   25     6   5  1  1    W 5    NaN
13   6  1998-10-17  Sat  Penn State (12)                NaN       Purdue  Big Ten                  W   31   13    18   4  1  1    W 4    NaN
14   5  1997-11-15  Sat   Penn State (6)                  @  Purdue (19)  Big Ten                  W   42   17    25   3  1  1    W 3    NaN
15   4  1996-10-12  Sat  Penn State (10)                NaN       Purdue  Big Ten                  W   31   14    17   2  1  1    W 2    NaN
16   3  1995-10-14  Sat  Penn State (20)                  @       Purdue  Big Ten                  W   26   23     3   1  1  1    W 1    NaN
17   2  1952-09-27  Sat       Penn State                NaN       Purdue  Western                  T   20   20     0   0  1  1    T 1    NaN
18   1  1951-11-03  Sat       Penn State                  @       Purdue  Western                  L    0   28   -28   0  1  0    L 1    NaN
  • Related