Home > Blockchain >  BeautifulSoup Isn't Producing A Result
BeautifulSoup Isn't Producing A Result

Time:02-12

I'm trying to web scrape a site that gives play by plays live. In this example though, the game is over but I still want to scrape the play by play. But when I use the code below, I'm unable to produce any results. The url variable is the real link to the site.

import requests
import re
import json
from bs4 import BeautifulSoup

url = "http://pointstreak.com/baseball/gamelive/?gameid=483832"
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
game = soup.findAll('tr',class_ = "inning_1 pbp-bottom-border")

print(game)

My result is []

Can anyone help guide me in the right direction so I can grab the data inside the tag. On a side note, I am new to this so I may be making a rookie mistake

CodePudding user response:

Comments are correct, the content is sourced from another URL entirely. You can view the source code for the page by adding the prefix view-source:http://pointstreak.com/baseball/gamelive/?gameid=483832. You can see that the game is actually sourced from http://baseball.pointstreak.com/scoreboard.html?leagueid=120. Plugging that into your request, you can get to the content, but you may need to add headers to your request (basically posing as a browser instead of a naked request from your ip address). Something like this would get you the ability to parse the soup and get meaningful content:

import requests                                                                                  
from bs4 import BeautifulSoup                                                                    
                                                                                                 
headers = requests.utils.default_headers()                                                       
headers.update({                                                                                 
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})                                                                                               
url = 'http://baseball.pointstreak.com/scoreboard.html?leagueid=120'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html')

CodePudding user response:

You'll have to grab the play by play from the final box score page. Quite easy since they put it in <table> tags. you can let pandas do the work for you (it uses bs4 under the hood).

import pandas as pd

url = 'http://baseball.pointstreak.com/boxscore.html?gameid=514852'
dfs = pd.read_html(url)

for idx, df in enumerate(dfs):
    if df.iloc[0,0] == 'Top of 1st':
        startIdx = idx
        break

pbp = pd.concat(dfs[startIdx:])

Output:

print(pbp)
                                      0                                                  1
0                            Top of 1st                                   Madison Mallards
1                        #55 E.J. Ranel  Foul, Ball, Ball, Swinging Strike, Ball, Foul,...
2                      #48 Ben Anderson  Ball, 48 Ben Anderson advances to 2nd (double ...
3                    #50 Justice Bigbie  Swinging Strike, Swinging Strike, 50 Justice B...
4                   #45 Austin Blazevic  45 Austin Blazevic putout (fly out to the shor...
..                                  ...                                                ...
5                Offensive Substitution              32 Zach Klapak subs for Kyle Simmons.
6                       #32 Zach Klapak  Ball, Swinging Strike, Called Strike, Ball, 32...
7                    #6 Brandon Seltzer  30 Kaeber Rog Scores Earned (6), 6 Brandon Sel...
8                        #20 Adam Frank  Called Strike, Ball, Ball, Called Strike, 20 A...
9   Runs: 1, Hits: 0, Errors: 0, LOB: 2                Runs: 1, Hits: 0, Errors: 0, LOB: 2

[120 rows x 2 columns]
  • Related