How to scrape this football page?-CodePudding

https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A

I wanna scrape the Team Stats, such as Possession and Shots on Target, also whats below like Fouls, Corners...

What I have now is very over complicated code, basically stripping and splitting multiple times this string to grab the values I want.

#getting a general info dataframe with all matches
championship_url = 'https://fbref.com/en/comps/24/1495/schedule/2016-Serie-A-Scores-and-Fixtures'
data = requests.get(URL)
time.sleep(3)
matches = pd.read_html(data.text, match="Resultados e Calendários")[0]

#putting stats info in each match entry (this is an example match to test)
match_url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
data = requests.get(match_url)
time.sleep(3)
soup = BeautifulSoup(data.text, features='lxml')

# ID the match to merge later on
home_team = soup.find("h1").text.split()[0]
round_week = float(soup.find("div", {'id': 'content'}).text.split()[18].strip(')'))

# collecting stats
stats = soup.find("div", {"id": "team_stats"}).text.split()[5:] #first part of stats with the progress bars
stats_extra = soup.find("div", {"id": "team_stats_extra"}).text.split()[2:] #second part

all_stats = {'posse_casa':[], 'posse_fora':[], 'chutestotais_casa':[], 'chutestotais_fora':[],
             'acertopasses_casa':[], 'acertopasses_fora':[], 'chutesgol_casa':[], 'chutesgol_fora':[],
             'faltas_casa':[], 'faltas_fora':[], 'escanteios_casa':[], 'escanteios_fora':[],
             'cruzamentos_casa':[], 'cruzamentos_fora':[], 'contatos_casa':[], 'contatos_fora':[],
             'botedef_casa':[], 'botedef_fora':[], 'aereo_casa':[], 'aereo_fora':[],
             'defesas_casa':[], 'defesas_fora':[], 'impedimento_casa':[], 'impedimento_fora':[],
             'tirometa_casa':[], 'tirometa_fora':[], 'lateral_casa':[], 'lateral_fora':[],
             'bolalonga_casa':[], 'bolalonga_fora':[], 'Em casa':[home_team], 'Sem':[round_week]}

#not gonna copy everything but is kinda like this for each stat
#stats = '\nEstatísticas do time\n\n\nCoritiba \n\n\n\t\n\n\n\n\n\n\n\n\n\n Cuiabá\n\nPosse\n\n\n\n42%\n\n\n\n\n\n58%\n\n\n\n\nChutes ao gol\n\n\n\n2 of 4\xa0—\xa050%\n\n\n\n\n\n0%\xa0—\xa00 of 8\n\n\n\n\nDefesas\n\n\n\n0 of 0\xa0—\xa0%\n\n\n\n\n\n50%\xa0—\xa01 of 2\n\n\n\n\nCartões\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
#first grabbing 42% possession
all_stats['posse_casa']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[0]
#grabbing 58% possession
all_stats['posse_fora']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[1]

all_stats_df = pd.DataFrame.from_dict(all_stats)
championship_data = matches.merge(all_stats_df, on=['Em casa','Sem'])

There are a lot of stats in that dic bc in previous championship years, FBref has all those stats, only in the current year championship there is only 12 of them to fill. I do intend to run the code in 5-6 different years, so I made a version with all stats, and in current year games I intend to fill with nothing when there's no stat in the page to scrap.

CodePudding user response：

You can get Fouls, Corners and Offsides and 7 tables worth of data from that page with the following code:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

coritiba_fouls = soup.find('div', string='Fouls').previous_sibling.text.strip()
cuiaba_fouls = soup.find('div', string='Fouls').next_sibling.text.strip()

coritiba_corners = soup.find('div', string='Corners').previous_sibling.text.strip()
cuiaba_corners = soup.find('div', string='Corners').next_sibling.text.strip()

coritiba_offsides = soup.find('div', string='Offsides').previous_sibling.text.strip()
cuiaba_offsides = soup.find('div', string='Offsides').next_sibling.text.strip()

print('Coritiba Fouls: '   coritiba_fouls, 'Cuiaba Fouls: '   cuiaba_fouls)
print('Coritiba Corners: '   coritiba_corners, 'Cuiaba Corners: '   cuiaba_corners)
print('Coritiba Offsides: '   coritiba_offsides, 'Cuiaba Offsides: '   cuiaba_offsides)
dfs = pd.read_html(r.text)
print('Number of tables: '   str(len(dfs)))
for df in dfs:
    print(df)
    print('___________')

This will print in the terminal:

Coritiba Fouls: 16 Cuiaba Fouls: 12
Coritiba Corners: 4 Cuiaba Corners: 4
Coritiba Offsides: 0 Cuiaba Offsides: 1
Number of tables: 7
   Coritiba (4-2-3-1)     Coritiba (4-2-3-1).1
0                  23             Alex Muralha
1                   2        Matheus Alexandre
2                   3                 Henrique
3                   4           Luciano Castán
4                   6    Egídio Pereira Júnior
5                   9              Léo Gamalho
6                  11               Alef Manga
7                  25          Bernanrdo Lemes
8                  78                    Régis
9                  97                 Valdemir
10                 98              Igor Paixão
11              Bench                    Bench
12                 21           Rafael William
13                  5  Guillermo de los Santos
14                 15           Matías Galarza
15                 16                 Natanael
16                 18           Guilherme Biro
17                 19          Thonny Anderson
18                 28      Pablo Javier García
19                 32              Bruno Gomes
20                 44             Márcio Silva
21                 52          Adrián Martínez
22                 75             Luiz Gabriel
23                 88                     Hugo
___________
   Cuiabá (4-1-4-1)   Cuiabá (4-1-4-1).1
0                 1               Walter
1                 2           João Lucas
2                 3              Joaquim
3                 4       Marllon Borges
4                 5               Camilo
5                 6          Igor Cariús
6                 7              Alesson
7                 8      João Pedro Pepê
8                 9             Valdívia
9                10  Rodriguinho Marinho
10               11          Rafael Gava
11            Bench                Bench
12               12          João Carlos
13               13        Daniel Guedes
14               14               Paulão
15               15         Marcão Silva
16               16       Cristian Rivas
17               17       Gabriel Pirani
18               18              Jenison
19               19                André
20               20        Kelvin Osorio
21               21        Jonathan Cafu
22               22           André Luis
23               23       Felipe Marques
___________
          Coritiba           Cuiabá
        Possession       Possession
0              42%              58%
1  Shots on Target  Shots on Target
2     2 of 4 — 50%      0% — 0 of 8
3            Saves            Saves
4       0 of 0 — %     50% — 1 of 2
5            Cards            Cards
6              NaN              NaN
_____________
[....]