I have tried the following script to make to grab the table on the webpage.
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.sports-reference.com/cfb/play-index/rivals.cgi?request=1&school_id=penn-state&opp_id=purdue'
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, 'html.parser')
soup.find('tbody')
However, the table is not able to be pulled. Not even a "pd.read_html" line works. Is there a reason for that?
CodePudding user response:
The desired table data is under html comment. By removing the comment,you can extract the table data using pandas only.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url= 'https://www.sports-reference.com/cfb/play-index/rivals.cgi?request=1&school_id=penn-state&opp_id=purdue'
res = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup =BeautifulSoup(res,'lxml')
table = soup.select_one('#div_results')
df = pd.read_html(str(table))[0]
d = df.droplevel(0, axis=1)
print(d)
Output:
G Date Day School Unnamed: 4_level_1 Opponent ... Diff W L T Streak Notes
0 19 2019-10-05 Sat Penn State (12) NaN Purdue ... 28 15 3 1 W 9 NaN
1 18 2016-10-29 Sat Penn State (24) @ Purdue ... 38 14 3 1 W 8 NaN
2 17 2013-11-16 Sat Penn State NaN Purdue ... 24 13 3 1 W 7 NaN
3 16 2012-11-03 Sat Penn State @ Purdue ... 25 12 3 1 W 6 NaN
4 15 2011-10-15 Sat Penn State NaN Purdue ... 5 11 3 1 W 5 NaN
5 14 2008-10-04 Sat Penn State (6) @ Purdue ... 14 10 3 1 W 4 NaN
6 13 2007-11-03 Sat Penn State NaN Purdue ... 7 9 3 1 W 3 NaN
7 12 2006-10-28 Sat Penn State @ Purdue ... 12 8 3 1 W 2 NaN
8 11 2005-10-29 Sat Penn State (11) NaN Purdue ... 18 7 3 1 W 1 NaN
9 10 2004-10-09 Sat Penn State NaN Purdue (9) ... -7 6 3 1 L 2 NaN
10 9 2003-10-11 Sat Penn State @ Purdue (18) ... -14 6 2 1 L 1 NaN
11 8 2000-09-30 Sat Penn State NaN Purdue (22) ... 2 6 1 1 W 6 NaN
12 7 1999-10-23 Sat Penn State (2) @ Purdue (16) ... 6 5 1 1 W 5 NaN
13 6 1998-10-17 Sat Penn State (12) NaN Purdue ... 18 4 1 1 W 4 NaN
14 5 1997-11-15 Sat Penn State (6) @ Purdue (19) ... 25 3 1 1 W 3 NaN
15 4 1996-10-12 Sat Penn State (10) NaN Purdue ... 17 2 1 1 W 2 NaN
16 3 1995-10-14 Sat Penn State (20) @ Purdue ... 3 1 1 1 W 1 NaN
17 2 1952-09-27 Sat Penn State NaN Purdue ... 0 0 1 1 T 1 NaN
18 1 1951-11-03 Sat Penn State @ Purdue ... -28 0 1 0 L 1 NaN
[19 rows x 16 columns]
CodePudding user response:
The <table>
is stored inside HTML comment <!-- -->
so beautifulsoup
normally doesn't see it. To parse it you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup, Comment
url = "https://www.sports-reference.com/cfb/play-index/rivals.cgi?request=1&school_id=penn-state&opp_id=purdue"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
}
pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, "html.parser")
df = pd.read_html("\n".join(soup.find_all(text=Comment)))[0]
df = df.droplevel(0, axis=1)
print(df)
Prints:
G Date Day School Unnamed: 4_level_1 Opponent Conf Unnamed: 7_level_1 Pts Opp Diff W L T Streak Notes
0 19 2019-10-05 Sat Penn State (12) NaN Purdue Big Ten W 35 7 28 15 3 1 W 9 NaN
1 18 2016-10-29 Sat Penn State (24) @ Purdue Big Ten W 62 24 38 14 3 1 W 8 NaN
2 17 2013-11-16 Sat Penn State NaN Purdue Big Ten W 45 21 24 13 3 1 W 7 NaN
3 16 2012-11-03 Sat Penn State @ Purdue Big Ten W 34 9 25 12 3 1 W 6 NaN
4 15 2011-10-15 Sat Penn State NaN Purdue Big Ten W 23 18 5 11 3 1 W 5 NaN
5 14 2008-10-04 Sat Penn State (6) @ Purdue Big Ten W 20 6 14 10 3 1 W 4 NaN
6 13 2007-11-03 Sat Penn State NaN Purdue Big Ten W 26 19 7 9 3 1 W 3 NaN
7 12 2006-10-28 Sat Penn State @ Purdue Big Ten W 12 0 12 8 3 1 W 2 NaN
8 11 2005-10-29 Sat Penn State (11) NaN Purdue Big Ten W 33 15 18 7 3 1 W 1 NaN
9 10 2004-10-09 Sat Penn State NaN Purdue (9) Big Ten L 13 20 -7 6 3 1 L 2 NaN
10 9 2003-10-11 Sat Penn State @ Purdue (18) Big Ten L 14 28 -14 6 2 1 L 1 NaN
11 8 2000-09-30 Sat Penn State NaN Purdue (22) Big Ten W 22 20 2 6 1 1 W 6 NaN
12 7 1999-10-23 Sat Penn State (2) @ Purdue (16) Big Ten W 31 25 6 5 1 1 W 5 NaN
13 6 1998-10-17 Sat Penn State (12) NaN Purdue Big Ten W 31 13 18 4 1 1 W 4 NaN
14 5 1997-11-15 Sat Penn State (6) @ Purdue (19) Big Ten W 42 17 25 3 1 1 W 3 NaN
15 4 1996-10-12 Sat Penn State (10) NaN Purdue Big Ten W 31 14 17 2 1 1 W 2 NaN
16 3 1995-10-14 Sat Penn State (20) @ Purdue Big Ten W 26 23 3 1 1 1 W 1 NaN
17 2 1952-09-27 Sat Penn State NaN Purdue Western T 20 20 0 0 1 1 T 1 NaN
18 1 1951-11-03 Sat Penn State @ Purdue Western L 0 28 -28 0 1 0 L 1 NaN