BeautifulSoup scraping incorrect table-CodePudding

I was scraping this site with the following code:

import requests
from bs4 import BeautifulSoup

url = "https://www.pro-football-reference.com/teams/buf/2021_injuries.htm"
r = requests.get(url)
stats_page = BeautifulSoup(r.content, features="lxml")

table = stats_page.findAll('table')[0] #get FIRST table on page
for player in table.findAll("tr"):
    print([i.getText() for i in player.findAll("td")])

The output is:

[]
['', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR']
['', 'Q', '', '', '', '', '', '']
['', '', '', '', 'Q', '', '', '']
['', '', '', 'O', '', '', '', 'IR']
['', '', 'Q', '', '', '', '', '']
['', '', '', 'Q', '', '', '', '']
['', '', '', '', 'Q', '', '', '']
['O', 'Q', '', '', '', '', '', '']
['', '', '', '', 'Q', '', '', '']
['', 'Q', '', 'Q', '', '', '', '']
['', '', '', 'O', '', '', '', '']
['Q', '', '', '', '', '', '', '']
['', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR']
['', '', 'Q', '', '', '', '', '']
['', 'IR', 'IR', 'IR', 'IR', '', '', '']

This is clearly the output I would expect from the 2nd table on the page, "Team Injuries", rather than the 1st table on the page, "Week 10 injury report". Any idea why BeautifulSoup is seemingly ignoring the first table on the page?

CodePudding user response：

The table you want is inside a comment, as such beautifulsoup will not parse the contents for more HTML.

You will need to first locate this comment containing the table and then parse the HTML inside that separately. For example:

import requests
from bs4 import BeautifulSoup, Comment

url = "https://www.pro-football-reference.com/teams/buf/2021_injuries.htm"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")

for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
    if '<table' in comment:
        soup_table = BeautifulSoup(comment, "lxml")
        table = soup_table.findAll('table')[0] #get FIRST table on page
        
        for player in table.findAll("tr"):
            print([i.getText() for i in player.findAll("td")])        
        
        break

This would display your output as:

[]
['DE', '', 'Injured Reserve', '']
['OG', '', 'Injured Reserve', '']
['WR', '', 'Injured Reserve', '']
['DE', 'DNP', '', 'Rest']
['WR', 'DNP', '', 'Rest']
['T', 'Limited', '', 'Back']
['ILB', 'DNP', '', 'Hamstring']
['CB', 'Limited', '', 'Hamstring']
['CB', 'Limited', '', 'Concussion']
['TE', 'Limited', '', 'Hand']
['RB', 'DNP', '', 'Concussion']