I want to scrape every football match from my favorite club Arsenal.
The website i intend to scrape: https://fbref.com.
I want to scrape every player that has scored in each match.
Here is my issue: If arsenal haven't scored but for instance only received a red card for a player, than i don't want to scrape the name of the player.
Example of a div i want to include:
<div><div ></div> <a href="/en/players/d5dd5f1f/Pierre-Emerick-Aubameyang">Pierre-Emerick Aubameyang</a> · 17’</div>
Example of a div i want to exclude:
<div><div ></div> <a href="/en/players/e61b8aee/Granit-Xhaka">Granit Xhaka</a> · 35’</div> </div>
My Question: How can i exclude a div/player that is related to a red card?
example of a match link = https://fbref.com/en/matches/d4650aa2/Manchester-City-Arsenal-August-28-2021-Premier-League.
My code
import re
match_data_list = []
for index, team in enumerate(team_list):
link, opponent, venue = team
#Because of devlopement i only iterate 4 times
if index <= 4:
if venue == "Home":
data_team_stat = requests.get(link)
soup_team = BeautifulSoup(data_team_stat.text)
match_data = soup_team.select('div.scorebox div#a')
for div in match_data:
print(div)
for player in div:
player_name = re.sub("[^A-Za-z*éØ\s-]","",player.text)
player_name = player_name.rstrip().strip()
if player_name != "":
data = (player_name, opponent)
match_data_list.append(data)
else:
data_team_stat = requests.get(link)
soup_team = BeautifulSoup(data_team_stat.text)
match_data = soup_team.select('div.scorebox div#b')
for div in match_data:
print(div)
for player in div:
player_name = re.sub("[^A-Za-z*éØ\s-]","",player.text)
player_name = player_name.rstrip().strip()
if player_name != "":
data = (player_name, opponent)
match_data_list.append(data)
else:
break
CodePudding user response:
If you just like to scrape the events with goals, adjust your selection and make it more specific:
soup.select('.scorebox #b div:has(.goal)')
Used css selectors
to select only <div>
in element with id=b
in scorebox, that have a class goal
.
Or to answer your question concerning exclude use pseudo-classes
combination like :has(:not())
:
soup.select('.scorebox #b div:not(:has(.red_card))')
Example
from bs4 import BeautifulSoup
import requests
url = 'https://fbref.com/en/matches/d4650aa2/Manchester-City-Arsenal-August-28-2021-Premier-League'
page=requests.get(url).content
soup=BeautifulSoup(page)
goals_a = [p.a.text for p in soup.select('.scorebox #a div:has(.goal)')]
goals_b = [p.a.text for p in soup.select('.scorebox #b div:has(.goal)')]
or with explicit exclude:
goals_a = [p.a.text for p in soup.select('.scorebox #a div:has(:not(.red_card))') if p.a]
goals_b = [p.a.text for p in soup.select('.scorebox #b div:not(:has(.red_card))') if p.a]
Output
['İlkay Gündoğan', 'Ferrán Torres', 'Gabriel Jesus', 'Rodri', 'Ferrán Torres']
[]