Scraping and Extracting Sports Reference's NBA Box Score data using Class and ID-CodePudding

I have recently worked on a scraper for NBA box scores from www.basketball-reference.com and am very new to Beautiful Soup. I have attempted to use Widgets however many are broken and it is not an option.So, I have attempted alternative methods for extracting the two stat tables. There is code that works ,for some such as this

import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'Https://www.basketball-reference.com/boxscores/202110190MIL.html/'
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
df1 = pd.read_html(html)[7]
df2 = pd.read_html(html)[12]
print(df1)
print(df2)

The output of two data frames, such as outputted by this, is my desired output.

While this works on specific games, the location of the tables is somewhat inconsistent across different games and years. So, for alternative links, the number is either out of bounds or returns the wrong table. I have tried to incorporate a myriad of exceptions that lead to different locations, however, this is quite cumbersome, slow, and ineffective. However, class and ID seem to be structured the same across all, from what I can tell, or at the very least, are a lot more consistent. However, I cannot come up with a method to extract them universally. Ultimately I need both the two basic box score tables extracted in full(Including Team Totals) into two separate data frames from any potential game link. My input data has both team names, and I believe the three-letter abbreviations are included in ids, so I am able to use them as such. If anyone can provide any help with this, that would be amazing. I have provided several other games with the alternative structures below as examples Thank you in advance.

https://www.basketball-reference.com/boxscores/202204010ORL.html/
https://www.basketball-reference.com/boxscores/202206160BOS.html/
https://www.basketball-reference.com/boxscores/194910290TRI.html/

CodePudding user response：

I don't think that panda's read_html() method can handle these tables. You may have to do it manually, along the lines below. I tried it on two of your 4 urls and it works, but you may have to tweak it for some other pages or for a particular presentation. You should remember that universal solutions in web scraping are rare, and even if one works today, it may not work next week.

This solutions involves beautifulsoup and css selectors. I apologize for not adding a line-by-line explanation of the code, but it's getting late here; once you review this, it should become more or less self evident:

So for a given url:

tabs = soup.select('table[id*="-game-basic"]')
for tab in tabs:
    cols, players = [], []
    for s in tab.select('thead tr:nth-child(2) th'):
        cols.append(s.text)
    for j in (tab.select('tbody tr, tfoot tr')):
        player = [dat.text for dat in j.select('td,th') ]
        players.append(player)       
    
    #some of the rows in some tables have blank cells, and these need padding
    max_length = len(max(players, key=len)) 
    players_plus = [player   [""]*(max_length - len(player)) for player in players]

    print(pd.DataFrame(players_plus,columns=cols))
    print('------------------------')

CodePudding user response：

Just wanted to point out 1 thing first. When you use pd.read_html(), it will return a list of tables/dataframes. Theres no need to do it twice. And is actually less efficient as it needs to make an http request twice then, for the same url.

So rather than doing this:

df1 = pd.read_html(html)[7]
df2 = pd.read_html(html)[12]

Do:

dfs = pd.read_html(html) # <-- returns a list of dataframes
df1 = dfs[7]
df2 = dfs[12]

So to your question, there's a couple ways to attack this:

Insert some logic that checks the length of the rows and/or columns, and pull out those.

**Note: I don't like this solution in this case, as there are multiple tables that fit this criteria. SO not the most robust in this situation.

urls = ['https://www.basketball-reference.com/boxscores/202110190MIL.html/',
        'https://www.basketball-reference.com/boxscores/202204010ORL.html/',
'https://www.basketball-reference.com/boxscores/202206160BOS.html/',
'https://www.basketball-reference.com/boxscores/194910290TRI.html/']

for url in urls:
    html = requests.get(url).text.replace('<!--', '').replace('-->', '')
    output = []
    dfs = pd.read_html(html)
    
    for each in output:
        if len(each.columns) >= 21 or len(each) >= 14:
            output.append(each)

You can use the html attributes to get the table you want:

This is the best option. Pull out the specific class or id you are after. This is using regex to find the id that is in the form of box-<team id>-game-basic

import requests
import pandas as pd
from bs4 import BeautifulSoup
import re


urls = ['https://www.basketball-reference.com/boxscores/202110190MIL.html/',
        'https://www.basketball-reference.com/boxscores/202204010ORL.html/',
'https://www.basketball-reference.com/boxscores/202206160BOS.html/',
'https://www.basketball-reference.com/boxscores/194910290TRI.html/']

for url in urls:
    html = requests.get(url).text.replace('<!--', '').replace('-->', '')
    soup = BeautifulSoup(html, 'html.parser')
    tables = soup.find_all('table', {'id': re.compile('box-.*-game-basic')})
    
    dfs = pd.read_html(str(tables), header=1)
    
    df1 = dfs[0]
    df2 = dfs[1]
    print(df1)
    print(df2)