I want to include or exclude divs based on a specific tag-CodePudding

I want to scrape every football match from my favorite club Arsenal.

The website i intend to scrape: https://fbref.com.

I want to scrape every player that has scored in each match.

Here is my issue: If arsenal haven't scored but for instance only received a red card for a player, than i don't want to scrape the name of the player.

Example of a div i want to include:

<div><div ></div> <a href="/en/players/d5dd5f1f/Pierre-Emerick-Aubameyang">Pierre-Emerick Aubameyang</a> · 17’</div>

Example of a div i want to exclude:

<div><div ></div> <a href="/en/players/e61b8aee/Granit-Xhaka">Granit Xhaka</a> · 35’</div> </div>

My Question: How can i exclude a div/player that is related to a red card?

example of a match link = https://fbref.com/en/matches/d4650aa2/Manchester-City-Arsenal-August-28-2021-Premier-League.

My code

import re
match_data_list = []

for index, team in enumerate(team_list):

link, opponent, venue = team
#Because of devlopement i only iterate 4 times
if index <= 4:
    if venue == "Home":
        data_team_stat = requests.get(link)
        soup_team      = BeautifulSoup(data_team_stat.text)
        match_data     = soup_team.select('div.scorebox div#a')
        for div in match_data:
            print(div)
            for player in div:
                
                
                player_name = re.sub("[^A-Za-z*éØ\s-]","",player.text)
                player_name = player_name.rstrip().strip()
                if player_name != "":    
                    data   = (player_name, opponent)
                    match_data_list.append(data)
                

    else:
        data_team_stat = requests.get(link)
        soup_team      = BeautifulSoup(data_team_stat.text)
        match_data     = soup_team.select('div.scorebox div#b')
        for div in match_data:
            print(div)
            for player in div:
                player_name = re.sub("[^A-Za-z*éØ\s-]","",player.text)
                player_name = player_name.rstrip().strip()
                if player_name != "":    
                    data   = (player_name, opponent)
                    match_data_list.append(data)
else:
    break

CodePudding user response：

If you just like to scrape the events with goals, adjust your selection and make it more specific:

soup.select('.scorebox #b div:has(.goal)')

Used css selectors to select only <div> in element with id=b in scorebox, that have a class goal.

Or to answer your question concerning exclude use pseudo-classes combination like :has(:not()):

soup.select('.scorebox #b div:not(:has(.red_card))')

Example

from bs4 import BeautifulSoup
import requests

url = 'https://fbref.com/en/matches/d4650aa2/Manchester-City-Arsenal-August-28-2021-Premier-League'
page=requests.get(url).content
soup=BeautifulSoup(page)

goals_a = [p.a.text for p in soup.select('.scorebox #a div:has(.goal)')]
goals_b = [p.a.text for p in soup.select('.scorebox #b div:has(.goal)')]

or with explicit exclude:

goals_a = [p.a.text for p in soup.select('.scorebox #a div:has(:not(.red_card))') if p.a]
goals_b = [p.a.text for p in soup.select('.scorebox #b div:not(:has(.red_card))') if p.a]

Output

['İlkay Gündoğan', 'Ferrán Torres', 'Gabriel Jesus', 'Rodri', 'Ferrán Torres']
[]