requests.get can't find table-CodePudding

here is my code:

url_joueurs = ('https://www.basketball-reference.com/leagues/NBA_2022_per_game.html')
result = requests.get(url_joueurs).text
data = BeautifulSoup(result, 'html.parser')

comments = data.find_all(string=lambda text: isinstance(text, Comment))

tables = []
for each in comments:
   if 'table' in str(each):
       try:
           tables.append(pd.read_html(str(each), attrs = {'id': 'totals_stats'})[0])
           break
       except:
           continue
Stats_joueurs = tables
print(Stats_joueurs)

The problem is that it returns an empty list (pd.df is outputted contained in a list).

Do you have an idea where the problem is ?

Thanks you.

CodePudding user response：

This problem is solvable with pandas (in three lines of code):

import pandas as pd

df = pd.read_html('https://www.basketball-reference.com/leagues/NBA_2022_per_game.html')[0]
print(df)

Result:

    Rk  Player  Pos     Age     Tm  G   GS  MP  FG  FGA     ...     FT%     ORB     DRB     TRB     AST     STL     BLK     TOV     PF  PTS
0   1   Precious Achiuwa    C   22  TOR     73  28  23.6    3.6     8.3     ...     .595    2.0     4.5     6.5     1.1     0.5     0.6     1.2     2.1     9.1
1   2   Steven Adams    C   28  MEM     76  75  26.3    2.8     5.1     ...     .543    4.6     5.4     10.0    3.4     0.9     0.8     1.5     2.0     6.9
2   3   Bam Adebayo     C   24  MIA     56  56  32.6    7.3     13.0    ...     .753    2.4     7.6     10.1    3.4     1.4     0.8     2.6     3.1     19.1
3   4   Santi Aldama    PF  21  MEM     32  0   11.3    1.7     4.1     ...     .625    1.0     1.7     2.7     0.7     0.2     0.3     0.5     1.1     4.1
4   5   LaMarcus Aldridge   C   36  BRK     47  12  22.3    5.4     9.7     ...     .873    1.6     3.9     5.5     0.9     0.3     1.0     0.9     1.7     12.9
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
837     601     Thaddeus Young  PF  33  TOR     26  0   18.3    2.6     5.5     ...     .481    1.5     2.9     4.4     1.7     1.2     0.4     0.8     1.7     6.3
838     602     Trae Young  PG  23  ATL     76  76  34.9    9.4     20.3    ...     .904    0.7     3.1     3.7     9.7     0.9     0.1     4.0     1.7     28.4
839     603     Omer Yurtseven  C   23  MIA     56  12  12.6    2.3     4.4     ...     .623    1.5     3.7     5.3     0.9     0.3     0.4     0.7     1.5     5.3
840     604     Cody Zeller     C   29  POR     27  0   13.1    1.9     3.3     ...     .776    1.9     2.8     4.6     0.8     0.3     0.2     0.7     2.1     5.2
841     605     Ivica Zubac     C   24  LAC     76  76  24.4    4.1     6.5     ...     .727    2.9     5.6     8.5     1.6     0.5     1.0     1.5     2.7     10.3

842 rows × 30 columns

Relevant pandas docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

CodePudding user response：

While Barry provides you with the code to get the data, there's no explanation on what the problem with your code is. There's 2 problems:

While those reference.com sites DO have some of their tables within the htnl comments, this particular page does not have that case. The <table> tag you are after is in the static html, while you are looking for <table> tags within the comments of the html.
Even them you are having bs4 look for the <table> tag with attribute id="totals_stats". There is no such table and attribute in this html. The table in the html attribute is id="per_game_stats".

As stated, just let pandas parse the table tags for you. Then do one simple line to clean up the repeat headers:

import pandas as pd

url_joueurs = ('https://www.basketball-reference.com/leagues/NBA_2022_per_game.html')
df = pd.read_html(url_joueurs)[0]
df = df[df['Rk'].ne('Rk')]