I am trying to pull HTML data from baseball-reference.com. I thought going to their website, viewing the page source, the html tags would be within the html code itself. However, after further investigation, the set of html tags that I care about are within comment blocks.
Example: https://www.baseball-reference.com/leagues/AL/2021-standard-batting.shtml Find the tag by "Viewing Source Code":
<div id="div_players_standard_batting">
The code I am looking for is below this line. And if you look above this line, you will see the comment block start <!-- and doesn't end until almost the end of the HTML file.
I can pull the HTML comments with the following code, but it comes with a few issues.
- It is in a list and I care only about the one that has the data
- It comes with new line tags
- I am struggling on how to take the players standard batting string code and reparse it as html code to use BeautifulSoup to grab the data I want.
Code:
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests
r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml
Data=[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment))]
Data
Current Environment Settings:
dependencies:
- python=3.9.7
- beautifulsoup4=4.11.1
- jupyterlab=3.3.2
- pandas=1.4.2
- pyodbc=4.0.32
The end goal: Be able to have a pandas dataframe that has each player's data from this web page.
CodePudding user response:
You are on the right track, you just have to put the individual parts together.
In the ResultSet
there should be only one element with id div_players_standard_batting
, so filter for it and take this element to transform it with pandas.read_html()
to a DataFrame:
pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]
or as alternative create a new bs4 object
and iterate over its rows:
soup = BeautifulSoup([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])
for row in soup.select('table tr'):
...
Output:
Rk | Name | Age | Tm | Lg | G | PA | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | BA | OBP | SLG | OPS | OPS | TB | GDP | HBP | SH | SF | IBB | Pos Summary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Fernando Abad* | 35 | BAL | AL | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2 | Cory Abbott | 25 | CHC | NL | 8 | 3 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.333 | 0.333 | 0.333 | 0.667 | 81 | 1 | 0 | 0 | 0 | 0 | 0 | /1H |
2 | 3 | Albert Abreu | 25 | NYY | AL | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 4 | Bryan Abreu | 24 | HOU | AL | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 5 | José Abreu | 34 | CHW | AL | 152 | 659 | 566 | 86 | 148 | 30 | 2 | 30 | 117 | 1 | 0 | 61 | 143 | 0.261 | 0.351 | 0.481 | 0.831 | 125 | 272 | 28 | 22 | 0 | 10 | 3 | *3D/5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1787 | 1720 | Bruce Zimmermann* | 26 | BAL | AL | 2 | 4 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | -100 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1788 | 1721 | Jordan Zimmermann | 35 | MIL | NL | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | -100 | 0 | 0 | 0 | 0 | 0 | 0 | /1 |
1789 | 1722 | Tyler Zuber | 26 | KCR | AL | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | -100 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1790 | 1723 | Mike Zunino | 30 | TBR | AL | 109 | 375 | 333 | 64 | 72 | 11 | 2 | 33 | 62 | 0 | 0 | 34 | 132 | 0.216 | 0.301 | 0.559 | 0.86 | 137 | 186 | 7 | 7 | 0 | 1 | 0 | 2/H |
1791 | nan | LgAvg per 600 PA | nan | nan | nan | 205 | 600 | 535 | 73 | 130 | 26 | 2 | 20 | 69 | 7 | 2 | 52 | 139 | 0.243 | 0.316 | 0.41 | 0.726 | nan | 219 | 11 | 7 | 2 | 4 | 2 | nan |
CodePudding user response:
First pull raw html and then remove comments with str.replace
using regex. Then parse it with beautifulsoup4
. I think this will do the trick