Home > Software design >  How to use beautiful soup find function to extract html elements
How to use beautiful soup find function to extract html elements

Time:10-30

I am trying to use beautiful soup to pull the table corresponding to the HTML code below

<table class="sortable stats_table now_sortable" id="team_pitching" data-cols-to-freeze=",2">
    <caption>Team Pitching</caption>

from https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2. Here is a screenshot of the site layout and HTML code I am trying to extract from.

I was using the code

url = 'https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2'
res = requests.get(url)
soup1 = BS(res.content, 'html.parser')
table1  = soup1.find('table',{'id':'team_pitching'})
table1

I can't seem to figure out how to get this working. The table above can be extracted with the line

table1  = soup1.find('table',{'id':'team_batting'})

and I figured similar code should work for the one below. Additionally, is there a way to extract this using the table class "sortable stats_table now_sortable" rather than id?

CodePudding user response:

The problem is that if you open the page normally it shows all the tables, however if you load the page with Developer Tools just the first table is shown. So, when you do your request the left tables are not included into the HTML you're getting. The table you're looking for is not shown until "Show team pitchin" button is pressed, to do this you could use Selenium and get the full HTML response.

CodePudding user response:

That is because the table you are looking for - i.e. <table> with id="team_pitching" is present as a comment inside the soup. You can check it for yourself.

You need to

  • Extract that comment from the soup
  • Convert it into a soup object
  • Extract the table data from the soup object.

Here is the complete code that does the above mentioned steps.

from bs4 import BeautifulSoup, Comment
import requests

url = 'https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

main_div = soup.find('div', {'id': 'all_team_pitching'})

# Extracting the comment from the above selected <div>
for comments in main_div.find_all(text=lambda x: isinstance(x, Comment)):
    temp = comments.extract()

# Converting the above extracted comment to a soup object
s = BeautifulSoup(temp, 'lxml')
trs = s.find('table', {'id': 'team_pitching'}).find_all('tr')

# Printing the first five entries of the table
for tr in trs[1:5]:
    print(list(tr.stripped_strings))

The first 5 entries from the table

['1', 'Tyler Ahearn', '21', '1', '0', '1.000', '1.93', '6', '0', '0', '1', '9.1', '8', '5', '2', '0', '4', '14', '0', '0', '0', '42', '1.286', '7.7', '0.0', '3.9', '13.5', '3.50']
['2', 'Jack Anderson', '20', '2', '0', '1.000', '0.79', '4', '1', '0', '0', '11.1', '6', '4', '1', '0', '3', '11', '1', '0', '0', '45', '0.794', '4.8', '0.0', '2.4', '8.7', '3.67']
['3', 'Shane Drohan', '*', '21', '0', '1', '.000', '4.08', '4', '4', '0', '0', '17.2', '15', '12', '8', '0', '11', '27', '1', '0', '2', '82', '1.472', '7.6', '0.0', '5.6', '13.8', '2.45']
['4', 'Conor Grady', '21', '2', '0', '1.000', '3.00', '4', '4', '0', '0', '15.0', '10', '5', '5', '3', '8', '15', '1', '0', '2', '68', '1.200', '6.0', '1.8', '4.8', '9.0', '1.88']
  • Related