Home > OS >  Scraping tbody data -- trouble
Scraping tbody data -- trouble

Time:07-28

I'm new with this, but struggling to understand how this makes sense/why nothing seems to be working.

Basically, all I want to do is scrape the table data (play by play) from a list of ncaa.com links (sample below)

https://stats.ncaa.org/game/play_by_play/12465

https://stats.ncaa.org/game/play_by_play/12755

https://stats.ncaa.org/game/play_by_play/12640

https://stats.ncaa.org/game/play_by_play/12290

For context, I got these links by scraping HREF tags from a different list of links (which contained every NCAA team's game schedule).

I've struggled through a lot of this, but there's been an answer somewhere...

Inspector makes it seem like the Play by Play (table) data is a tbody tag, or at least I think?

I've tried a script as simple as this (which works for other websites)

import pandas as pd

df = pd.read_html(
    'https://stats.ncaa.org/game/play_by_play/13592')[0]
print(df)

But it still didn't work for this site... I read a bit about using lxml.parser instead of html.parser (like in the code below).. but also not working -- I thought this was my best chance at getting the tables from multiple links at once:

import requests
from bs4 import BeautifulSoup

profiles = []
urls = [
    'https://stats.ncaa.org/game/play_by_play/12564',
    'https://stats.ncaa.org/game/play_by_play/13592'

]
for url in urls:
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'lxml.parser')
    for profile in soup.find_all('a'):

        profile = profile.get('tbody')

        profiles.append(profile)

# print(profiles)

for p in profiles:
    print(p)

Any thoughts as to what is unique about this site/what could be the issue would be greatly appreciated.

CodePudding user response:

That website will check if the request comes from a bot or a browser, so you need to update requests' header with a real user-agent.

Each page has 8 tables. The code below will go through each url you mentioned above and print out all tables. You can review them, see which one you need, etc:

import requests
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

urls = [
    'https://stats.ncaa.org/game/play_by_play/12465',
'https://stats.ncaa.org/game/play_by_play/12755',
'https://stats.ncaa.org/game/play_by_play/12640',
'https://stats.ncaa.org/game/play_by_play/12290',
]
s = requests.Session()
s.headers.update(headers)
for url in urls:
    r = s.get(url)
    dfs = pd.read_html(r.text)
    len(dfs)
    for df in dfs:
        print(df)
        print('___________')

Response:

  0         1         2      3
0    NaN  1st Half  2nd Half  Total
1  UTRGV        23        19     42
2   UTEP        26        34     60
___________
             0                                    1
0   Game Date:                           01/02/2009
1    Location:  El Paso, Texas (Don Haskins Center)
2  Attendance:                                 8413
___________
            0                                          1
0  Officials:  John Higgins, Duke Edsall, Quinton, Reece
___________
          0     1
0  1st Half  1  2
___________
                   0                                       1                2  \
0               Time                                   UTRGV            Score   
1              19:45                                     NaN              0-0   
2              19:45                                     NaN              0-0   
3              19:33                                     NaN              0-0   
4              19:33        Emmanuel Jones Defensive Rebound              0-0   
..               ...                                     ...              ...   
163            00:11                                     NaN            23-25   
164            00:11                                     NaN            23-26   
165            00:00  Emmanuel Jones missed Two Point Jumper            23-26   
166            00:00                                     NaN            23-26   
167  End of 1st Half                         End of 1st Half  End of 1st Half   

                                            3  
0                                        UTEP  
1     Arnett Moultrie missed Two Point Jumper  
2              Julyan Stone Offensive Rebound  
3    Stefon Jackson missed Three Point Jumper  
4                                         NaN  
..                                        ...  
163            Stefon Jackson made Free Throw  
164            Stefon Jackson made Free Throw  
165                                       NaN  
166            Julyan Stone Defensive Rebound  
167                           End of 1st Half  

[168 rows x 4 columns]
___________
          0     1
0  2nd Half  1  2
[..]
  • Related