I'm trying to web scrape a site that gives play by plays live. In this example though, the game is over but I still want to scrape the play by play. But when I use the code below, I'm unable to produce any results. The url variable is the real link to the site.
import requests
import re
import json
from bs4 import BeautifulSoup
url = "http://pointstreak.com/baseball/gamelive/?gameid=483832"
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
game = soup.findAll('tr',class_ = "inning_1 pbp-bottom-border")
print(game)
My result is []
Can anyone help guide me in the right direction so I can grab the data inside the tag. On a side note, I am new to this so I may be making a rookie mistake
CodePudding user response:
Comments are correct, the content is sourced from another URL entirely. You can view the source code for the page by adding the prefix view-source:http://pointstreak.com/baseball/gamelive/?gameid=483832
. You can see that the game is actually sourced from http://baseball.pointstreak.com/scoreboard.html?leagueid=120. Plugging that into your request, you can get to the content, but you may need to add headers to your request (basically posing as a browser instead of a naked request from your ip address). Something like this would get you the ability to parse the soup and get meaningful content:
import requests
from bs4 import BeautifulSoup
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
url = 'http://baseball.pointstreak.com/scoreboard.html?leagueid=120'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html')
CodePudding user response:
You'll have to grab the play by play from the final box score page. Quite easy since they put it in <table>
tags. you can let pandas
do the work for you (it uses bs4 under the hood).
import pandas as pd
url = 'http://baseball.pointstreak.com/boxscore.html?gameid=514852'
dfs = pd.read_html(url)
for idx, df in enumerate(dfs):
if df.iloc[0,0] == 'Top of 1st':
startIdx = idx
break
pbp = pd.concat(dfs[startIdx:])
Output:
print(pbp)
0 1
0 Top of 1st Madison Mallards
1 #55 E.J. Ranel Foul, Ball, Ball, Swinging Strike, Ball, Foul,...
2 #48 Ben Anderson Ball, 48 Ben Anderson advances to 2nd (double ...
3 #50 Justice Bigbie Swinging Strike, Swinging Strike, 50 Justice B...
4 #45 Austin Blazevic 45 Austin Blazevic putout (fly out to the shor...
.. ... ...
5 Offensive Substitution 32 Zach Klapak subs for Kyle Simmons.
6 #32 Zach Klapak Ball, Swinging Strike, Called Strike, Ball, 32...
7 #6 Brandon Seltzer 30 Kaeber Rog Scores Earned (6), 6 Brandon Sel...
8 #20 Adam Frank Called Strike, Ball, Ball, Called Strike, 20 A...
9 Runs: 1, Hits: 0, Errors: 0, LOB: 2 Runs: 1, Hits: 0, Errors: 0, LOB: 2
[120 rows x 2 columns]