here is my code:
url_joueurs = ('https://www.basketball-reference.com/leagues/NBA_2022_per_game.html')
result = requests.get(url_joueurs).text
data = BeautifulSoup(result, 'html.parser')
comments = data.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(str(each), attrs = {'id': 'totals_stats'})[0])
break
except:
continue
Stats_joueurs = tables
print(Stats_joueurs)
The problem is that it returns an empty list (pd.df is outputted contained in a list).
Do you have an idea where the problem is ?
Thanks you.
CodePudding user response:
This problem is solvable with pandas (in three lines of code):
import pandas as pd
df = pd.read_html('https://www.basketball-reference.com/leagues/NBA_2022_per_game.html')[0]
print(df)
Result:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Precious Achiuwa C 22 TOR 73 28 23.6 3.6 8.3 ... .595 2.0 4.5 6.5 1.1 0.5 0.6 1.2 2.1 9.1
1 2 Steven Adams C 28 MEM 76 75 26.3 2.8 5.1 ... .543 4.6 5.4 10.0 3.4 0.9 0.8 1.5 2.0 6.9
2 3 Bam Adebayo C 24 MIA 56 56 32.6 7.3 13.0 ... .753 2.4 7.6 10.1 3.4 1.4 0.8 2.6 3.1 19.1
3 4 Santi Aldama PF 21 MEM 32 0 11.3 1.7 4.1 ... .625 1.0 1.7 2.7 0.7 0.2 0.3 0.5 1.1 4.1
4 5 LaMarcus Aldridge C 36 BRK 47 12 22.3 5.4 9.7 ... .873 1.6 3.9 5.5 0.9 0.3 1.0 0.9 1.7 12.9
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
837 601 Thaddeus Young PF 33 TOR 26 0 18.3 2.6 5.5 ... .481 1.5 2.9 4.4 1.7 1.2 0.4 0.8 1.7 6.3
838 602 Trae Young PG 23 ATL 76 76 34.9 9.4 20.3 ... .904 0.7 3.1 3.7 9.7 0.9 0.1 4.0 1.7 28.4
839 603 Omer Yurtseven C 23 MIA 56 12 12.6 2.3 4.4 ... .623 1.5 3.7 5.3 0.9 0.3 0.4 0.7 1.5 5.3
840 604 Cody Zeller C 29 POR 27 0 13.1 1.9 3.3 ... .776 1.9 2.8 4.6 0.8 0.3 0.2 0.7 2.1 5.2
841 605 Ivica Zubac C 24 LAC 76 76 24.4 4.1 6.5 ... .727 2.9 5.6 8.5 1.6 0.5 1.0 1.5 2.7 10.3
842 rows × 30 columns
Relevant pandas docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
CodePudding user response:
While Barry provides you with the code to get the data, there's no explanation on what the problem with your code is. There's 2 problems:
- While those reference.com sites DO have some of their tables within the htnl comments, this particular page does not have that case. The
<table>
tag you are after is in the static html, while you are looking for<table>
tags within the comments of the html. - Even them you are having bs4 look for the
<table>
tag with attributeid="totals_stats"
. There is no such table and attribute in this html. The table in the html attribute isid="per_game_stats"
.
As stated, just let pandas parse the table tags for you. Then do one simple line to clean up the repeat headers:
import pandas as pd
url_joueurs = ('https://www.basketball-reference.com/leagues/NBA_2022_per_game.html')
df = pd.read_html(url_joueurs)[0]
df = df[df['Rk'].ne('Rk')]