Home > database >  Cannot scrape some table using Pandas
Cannot scrape some table using Pandas

Time:08-06

i'm more than a noob in python, i'm tryng to get some tables from this page:

https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html

Using Pandas and command pd.read_html i'm able to get most of them but not the "Line Score" and the "Four Factors"...if i print all the table (they are 19) these two are missing, inspecting with chrome they seem to be table and i also get them with excel importing from web. What am i missing here? Any help appreciated, thanks!

CodePudding user response:

If you look at the page source (not by inspecting), you'd see those tables are within the comments of the html. You can either a) edit the html str and remove the <!-- and --> from the html, then let pandas parse, or 2) use bs4 to pull out the comments, then parse that tables that way.

I'll show you both options:

Option 1: Remove the comment tags from the page source

import requests
import pandas as pd

url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
response = requests.get(url).text.replace("<!--","").replace("-->","")

dfs = pd.read_html(response, header=1)

Output:

You can see you now have 21 tables, with the 3rd and 4th tables the ones in question.

print(len(dfs))
for each in dfs[3:5]:
    print('\n\n', each, '\n')

21


        Unnamed: 0   1   2   3   4   T
0  Minnesota Lynx  18  14  22  23  77
1   Seattle Storm  30  26  22  11  89 



   Unnamed: 0  Pace   eFG%  TOV%  ORB%  FT/FGA   ORtg
0        MIN  97.0  0.507  16.1  14.3   0.101   95.2
1        SEA  97.0  0.579  11.8   9.7   0.114  110.1 

Option 2: Pull out comments with bs4

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd


url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')


dfs = pd.read_html(url, header=1)

comments = data.find_all(string=lambda text: isinstance(text, Comment))

other_tables = []
for each in comments:
    if '<table' in str(each):
        try:
            other_tables.append(pd.read_html(str(each), header=1)[0])
        except:
            continue

Output:

for each in other_tables:
    print(each, '\n')


       Unnamed: 0   1   2   3   4   T
0  Minnesota Lynx  18  14  22  23  77
1   Seattle Storm  30  26  22  11  89 

  Unnamed: 0  Pace   eFG%  TOV%  ORB%  FT/FGA   ORtg
0        MIN  97.0  0.507  16.1  14.3   0.101   95.2
1        SEA  97.0  0.579  11.8   9.7   0.114  110.1 
  • Related