I'm new with this, but struggling to understand how this makes sense/why nothing seems to be working.
Basically, all I want to do is scrape the table data (play by play) from a list of ncaa.com links (sample below)
https://stats.ncaa.org/game/play_by_play/12465
https://stats.ncaa.org/game/play_by_play/12755
https://stats.ncaa.org/game/play_by_play/12640
https://stats.ncaa.org/game/play_by_play/12290
For context, I got these links by scraping HREF tags from a different list of links (which contained every NCAA team's game schedule).
I've struggled through a lot of this, but there's been an answer somewhere...
Inspector makes it seem like the Play by Play (table) data is a tbody tag, or at least I think?
I've tried a script as simple as this (which works for other websites)
import pandas as pd
df = pd.read_html(
'https://stats.ncaa.org/game/play_by_play/13592')[0]
print(df)
But it still didn't work for this site... I read a bit about using lxml.parser instead of html.parser (like in the code below).. but also not working -- I thought this was my best chance at getting the tables from multiple links at once:
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/game/play_by_play/12564',
'https://stats.ncaa.org/game/play_by_play/13592'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml.parser')
for profile in soup.find_all('a'):
profile = profile.get('tbody')
profiles.append(profile)
# print(profiles)
for p in profiles:
print(p)
Any thoughts as to what is unique about this site/what could be the issue would be greatly appreciated.
CodePudding user response:
That website will check if the request comes from a bot or a browser, so you need to update requests' header with a real user-agent.
Each page has 8 tables. The code below will go through each url you mentioned above and print out all tables. You can review them, see which one you need, etc:
import requests
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
urls = [
'https://stats.ncaa.org/game/play_by_play/12465',
'https://stats.ncaa.org/game/play_by_play/12755',
'https://stats.ncaa.org/game/play_by_play/12640',
'https://stats.ncaa.org/game/play_by_play/12290',
]
s = requests.Session()
s.headers.update(headers)
for url in urls:
r = s.get(url)
dfs = pd.read_html(r.text)
len(dfs)
for df in dfs:
print(df)
print('___________')
Response:
0 1 2 3
0 NaN 1st Half 2nd Half Total
1 UTRGV 23 19 42
2 UTEP 26 34 60
___________
0 1
0 Game Date: 01/02/2009
1 Location: El Paso, Texas (Don Haskins Center)
2 Attendance: 8413
___________
0 1
0 Officials: John Higgins, Duke Edsall, Quinton, Reece
___________
0 1
0 1st Half 1 2
___________
0 1 2 \
0 Time UTRGV Score
1 19:45 NaN 0-0
2 19:45 NaN 0-0
3 19:33 NaN 0-0
4 19:33 Emmanuel Jones Defensive Rebound 0-0
.. ... ... ...
163 00:11 NaN 23-25
164 00:11 NaN 23-26
165 00:00 Emmanuel Jones missed Two Point Jumper 23-26
166 00:00 NaN 23-26
167 End of 1st Half End of 1st Half End of 1st Half
3
0 UTEP
1 Arnett Moultrie missed Two Point Jumper
2 Julyan Stone Offensive Rebound
3 Stefon Jackson missed Three Point Jumper
4 NaN
.. ...
163 Stefon Jackson made Free Throw
164 Stefon Jackson made Free Throw
165 NaN
166 Julyan Stone Defensive Rebound
167 End of 1st Half
[168 rows x 4 columns]
___________
0 1
0 2nd Half 1 2
[..]