I recently worked on a scraper for NBA box scores from www.basketball-reference.com and am very new to Beautiful Soup. I have been incorporating modified code from the incredible basketball-reference-scraper. Unfortunately, this incorporates the widget function offered by Sports Reference, and several thousand game widgets are entirely broken(Sports reference seems aware of this). A good example is the link here: https://www.basketball-reference.com/boxscores/196911280CIN.html So, my question is, does anyone know a method for extracting the two basic stat tables from the original HTML?
CodePudding user response:
Comments don't hold good formatting, so here goes nothing:
from bs4 import BeautifulSoup
import requests
# make a request to get the page
page = requests.get("https://www.basketball-reference.com/boxscores/196911280CIN.html")
# part of the request that gets returned is the html content, which is stored in page.content
page = page.content
# create a new BeautifulSoup instance with the content and tell it to use the html parser.
soup = BeautifulSoup(page, "html.parse")
# find all instances of a specific element and print them out.
for i, e in enumerate(soup.find_all('table')):
print(i, e)
CodePudding user response:
You are in luck that most of the tables and elements on that site use ID tags for the fields, so you can use that in your selector in BeautifulSoup.
Also, since the data is structured it should be easier to parse the tables that are consistently the same size.
See this example: https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/
Look at the section on: Searching for tags by class and id