scraping the Fbref website to get specific player info so that I can use that for further analysis.
I have selected the table I want to scrape. The information I want is in <tr>
tags without any class attributes.
But the issue is that this table has many headers in <tr>
tags that have a class name
import requests
from bs4 import BeautifulSoup
from time import sleep
url = "https://fbref.com/en/comps/9/2021-2022/stats/2021-2022-Premier-League-Stats"
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(response, "html.parser")
I have selected the desired table I want to scrape. I want to select <tr>
tags that don't have any class attribute because that's where the information I want is located.
players_table = soup.select("table#stats_standard tbody tr", class_ =None)
I have then looped through the players_table so that I can get each player's info like name, country, position, etc.
for player in players_table:
player_name = player.find("td", attrs={"data-stat" : "player"}).a.text
print(player_name)
sleep(2)
But now the problem is that my code will loop through the table and when it finds the <tr >
tag, it tries to look for its <a>
tag and then further look for the text in the <a>
tag. But this specific <tr >
tag doesn't have any <a>
tags and that makes my code to break and get this error message 'NoneType' object has no attribute 'a' when I try to run it.
My code prints the names of the players untill it finds this <tr >
tag with no <a>
then it just fails & breaks.
I have even tried to decompose or clear this <tr >
tag, but it still doesn't work.
player.find(".thead").decompose()
So my question is how can I select only tags that don't have any class so that when my reaches tag, it just neglects it. I have actually tried doing that by using class_ = None when making the table
players_table = soup.select("table#stats_standard tbody tr", class_ =None)
But this didn't solve anything. I need your help on this, please.
CodePudding user response:
If you only wanna exclude the subheaders adjust your selector, that it only selects these <tr>
without class .thead
:
soup.select('table#stats_standard tbody tr:not(.thead)')
or more specific to the title of your question that do not have a class attribute:
soup.select('table#stats_standard tbody tr:not([class])')
Example
import requests
from bs4 import BeautifulSoup
from time import sleep
url = "https://fbref.com/en/comps/9/2021-2022/stats/2021-2022-Premier-League-Stats"
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(response)
for player in soup.select('table#stats_standard tbody tr:not([class])'):
player_name = player.find("td", attrs={"data-stat" : "player"}).a.text
print(player_name)
CodePudding user response:
Why not just let pandas
parse that. Then you can do whatever you want with the table.
import requests
import pandas as pd
from bs4 import BeautifulSoup
import requests
from bs4 import BeautifulSoup
from time import sleep
url = "https://fbref.com/en/comps/9/2021-2022/stats/2021-2022-Premier-League-Stats"
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
df = pd.read_html(response, header=1)[-1]
df = df[df['Rk'].ne('Rk')]
Output:
print(df)
Rk Player Nation Pos ... xG xA npxG.1 npxG xA.1 Matches
0 1 Max Aarons eng ENG DF ... 0.07 0.02 0.07 Matches
1 2 Che Adams sct SCO FW ... 0.43 0.31 0.43 Matches
2 3 Rayan Aït Nouri fr FRA DF ... 0.10 0.04 0.10 Matches
3 4 Kristoffer Ajer no NOR DF ... 0.10 0.04 0.10 Matches
4 5 Nathan Aké nl NED DF ... 0.16 0.11 0.16 Matches
.. ... ... ... ... ... ... ... ... ...
562 542 Wilfried Zaha ci CIV FW ... 0.46 0.13 0.29 Matches
563 543 Christoph Zimmermann de GER DF ... 0.04 0.04 0.04 Matches
564 544 Oleksandr Zinchenko ua UKR DF ... 0.21 0.04 0.21 Matches
565 545 Hakim Ziyech ma MAR FW,MF ... 0.47 0.23 0.47 Matches
566 546 Kurt Zouma fr FRA DF ... 0.04 0.04 0.04 Matches
[546 rows x 33 columns]
or
for player in df['Player']:
print(player)