This is the site I'm trying to retrieve information from: https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml I want to get the box score data so like the Oakland A's total batting average in the game, at bats in the game, etc. However, when I retreive and print the html from the site, these box scores are missing completely from the html. Any suggestions? Thanks. Here's my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify)
Please help! Thanks! I tried selenium and had the same problem.
CodePudding user response:
The page is loaded by javascript. Try using the requests_html
package instead. See below sample.
from bs4 import BeautifulSoup
from requests_html import HTMLSession
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
s = HTMLSession()
page = s.get(url, timeout=20)
page.html.render()
soup = BeautifulSoup(page.html.html, 'html.parser')
print(soup.prettify)
CodePudding user response:
The other tables are there in the requested html, but within the comments. So you need to parse out the comments to get those additional tables:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
comments = data.find_all(string=lambda text: isinstance(text, Comment))
tables = pd.read_html(url)
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(str(each))[0])
except:
continue
Output:
Oakland
print(tables[2].to_string())
Batting AB R H RBI BB SO PA BA OBP SLG OPS Pit Str WPA aLI WPA WPA- cWPA acLI RE24 PO A Details
0 Mark Canha LF 6.0 1.0 1.0 3.0 0.0 0.0 6.0 0.247 0.379 0.415 0.793 23.0 19.0 0.011 0.58 0.040 -0.029% 0.01% 1.02 1.0 1.0 0.0 2B
1 Starling Marte CF 3.0 0.0 2.0 3.0 0.0 1.0 4.0 0.325 0.414 0.476 0.889 12.0 7.0 0.116 0.90 0.132 -0.016% 0.12% 1.57 2.8 1.0 0.0 2B,HBP
2 Stephen Piscotty PH-RF 1.0 0.0 1.0 2.0 0.0 0.0 2.0 0.211 0.272 0.349 0.622 7.0 3.0 0.000 0.00 0.000 0.000% 0% 0.00 2.0 1.0 0.0 HBP
3 Matt Olson 1B 6.0 0.0 1.0 2.0 0.0 0.0 6.0 0.283 0.376 0.566 0.941 21.0 13.0 -0.057 0.45 0.008 -0.065% -0.06% 0.78 -0.6 9.0 1.0 GDP
4 Mitch Moreland DH 5.0 3.0 2.0 2.0 0.0 0.0 6.0 0.230 0.290 0.415 0.705 23.0 16.0 0.049 0.28 0.064 -0.015% 0.05% 0.50 1.5 NaN NaN 2·HR,HBP
5 Josh Harrison 2B 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.294 0.366 0.435 0.801 7.0 3.0 0.057 1.50 0.057 0.000% 0.06% 2.63 0.6 0.0 0.0 NaN
6 Tony Kemp 2B 4.0 3.0 3.0 0.0 1.0 0.0 5.0 0.252 0.370 0.381 0.751 16.0 10.0 -0.001 0.14 0.009 -0.010% 0% 0.24 1.6 2.0 2.0 NaN
7 Sean Murphy C 4.0 3.0 2.0 2.0 2.0 1.0 6.0 0.224 0.318 0.419 0.737 25.0 15.0 0.143 0.38 0.151 -0.007% 0.15% 0.67 2.7 7.0 0.0 2B
8 Matt Chapman 3B 1.0 3.0 0.0 0.0 5.0 1.0 6.0 0.214 0.310 0.365 0.676 31.0 10.0 0.051 0.28 0.051 0.000% 0.05% 0.49 2.2 1.0 3.0 NaN
9 Seth Brown RF-CF 5.0 1.0 1.0 1.0 0.0 1.0 6.0 0.204 0.278 0.451 0.730 18.0 12.0 -0.067 0.40 0.000 -0.067% -0.07% 0.70 -1.7 4.0 0.0 SF
10 Elvis Andrus SS 5.0 2.0 1.0 2.0 1.0 0.0 6.0 0.233 0.283 0.310 0.593 20.0 15.0 0.015 0.42 0.050 -0.034% 0.02% 0.73 -0.1 0.0 4.0 NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 Chris Bassitt P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 0.0 NaN
13 A.J. Puk P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
14 Deolis Guerra P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
15 Jake Diekman P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
16 Team Totals 40.0 17.0 14.0 17.0 10.0 4.0 54.0 0.350 0.500 0.575 1.075 203.0 123.0 0.317 0.41 0.562 -0.243% 0.33% 0.72 12.2 27.0 10.0 NaN