Home > other >  How to scrape data from within a comment block and create a dataframe?
How to scrape data from within a comment block and create a dataframe?

Time:05-29

I am trying to pull HTML data from baseball-reference.com. I thought going to their website, viewing the page source, the html tags would be within the html code itself. However, after further investigation, the set of html tags that I care about are within comment blocks.

Example: https://www.baseball-reference.com/leagues/AL/2021-standard-batting.shtml Find the tag by "Viewing Source Code":

<div  id="div_players_standard_batting">

The code I am looking for is below this line. And if you look above this line, you will see the comment block start <!-- and doesn't end until almost the end of the HTML file.

I can pull the HTML comments with the following code, but it comes with a few issues.

  1. It is in a list and I care only about the one that has the data
  2. It comes with new line tags
  3. I am struggling on how to take the players standard batting string code and reparse it as html code to use BeautifulSoup to grab the data I want.

Code:

from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml

Data=[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment))]
Data

Current Environment Settings:

dependencies:
  - python=3.9.7
  - beautifulsoup4=4.11.1
  - jupyterlab=3.3.2
  - pandas=1.4.2
  - pyodbc=4.0.32

The end goal: Be able to have a pandas dataframe that has each player's data from this web page.

CodePudding user response:

You are on the right track, you just have to put the individual parts together.

In the ResultSet there should be only one element with id div_players_standard_batting, so filter for it and take this element to transform it with pandas.read_html() to a DataFrame:

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

or as alternative create a new bs4 object and iterate over its rows:

soup = BeautifulSoup([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])
for row in soup.select('table tr'):
    ...

Output:

Rk Name Age Tm Lg G PA AB R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS OPS TB GDP HBP SH SF IBB Pos Summary
0 1 Fernando Abad* 35 BAL AL 2 0 0 0 0 0 0 0 0 0 0 0 0 nan nan nan nan nan 0 0 0 0 0 0 1
1 2 Cory Abbott 25 CHC NL 8 3 3 0 1 0 0 0 0 0 0 0 1 0.333 0.333 0.333 0.667 81 1 0 0 0 0 0 /1H
2 3 Albert Abreu 25 NYY AL 3 0 0 0 0 0 0 0 0 0 0 0 0 nan nan nan nan nan 0 0 0 0 0 0 1
3 4 Bryan Abreu 24 HOU AL 1 0 0 0 0 0 0 0 0 0 0 0 0 nan nan nan nan nan 0 0 0 0 0 0 1
4 5 José Abreu 34 CHW AL 152 659 566 86 148 30 2 30 117 1 0 61 143 0.261 0.351 0.481 0.831 125 272 28 22 0 10 3 *3D/5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1787 1720 Bruce Zimmermann* 26 BAL AL 2 4 4 0 0 0 0 0 0 0 0 0 3 0 0 0 0 -100 0 0 0 0 0 0 1
1788 1721 Jordan Zimmermann 35 MIL NL 2 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 -100 0 0 0 0 0 0 /1
1789 1722 Tyler Zuber 26 KCR AL 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 -100 0 0 0 0 0 0 1
1790 1723 Mike Zunino 30 TBR AL 109 375 333 64 72 11 2 33 62 0 0 34 132 0.216 0.301 0.559 0.86 137 186 7 7 0 1 0 2/H
1791 nan LgAvg per 600 PA nan nan nan 205 600 535 73 130 26 2 20 69 7 2 52 139 0.243 0.316 0.41 0.726 nan 219 11 7 2 4 2 nan

CodePudding user response:

First pull raw html and then remove comments with str.replace using regex. Then parse it with beautifulsoup4. I think this will do the trick

  • Related