I'm currently trying to webscrap some website.
Here is part of my code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
#Get the source code
r = requests.get('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard')
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())
for a in soup:
print (soup.find("td", {"data-stat" : "avg_age"}).text)
Basically, I have the whole source code inside "soup". However, when I call elements such as "td", {"data-stat" : "avg_age"} I only get repeated result of the first row {"data-row":"0"} as an output:
29.1
29.1
29.1
29.1
29.1
So here are my questions:
-> Why my code is stuck in first row while there is no preselection in my "soup" variable ?
-> Is there a way to make a loop that could check all the wanted elements for a different row each time ? "data-row":"0" to "data-row":19 for instance.
Thanks for your support and have a great day !
CodePudding user response:
It's stuck in the first row for a couple of reasons:
- you are using the
.find()
which only returns the first element it "finds" in the html soup object. - You never iterate through anything.
soup.find("td", {"data-stat" : "avg_age"}).text
will always return the same thing. Look at your loop.
Essentially this would be the same logic as you have there:
for x in [1, 2, 3, 4]:
print(1)
As it iterates through that list, it's just going to print 1
and you will get the 1
4 times in your console.
You need to get all the rows in soup
with soup.find_all('tr')
. Then when you iterate, if there is a <td>
class with attribute data-stat="avg_age"
, only then do you want to .find()
it and get the text.
import pandas as pd
from bs4 import BeautifulSoup
import requests
#Get the source code
r = requests.get('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard')
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())
rows = soup.find_all('tr')
for a in rows:
if a.find("td", {"data-stat" : "avg_age"}):
print (a.find("td", {"data-stat" : "avg_age"}).text)
Output:
29.1
26.8
29.4
26.8
27.8
26.2
27.2
25.8
26.0
26.9
24.8
25.5
26.9
25.9
27.6
24.5
26.3
28.8
25.6
26.7
26.1
28.2
26.9
26.6
26.0
27.7
28.0
26.8
29.9
25.5
27.1
27.1
27.1
27.2
27.0
27.0
25.1
25.8
25.9
25.8
Just as note, pandas
' .read_html()
uses bs4 under the hood to parse <table>
tags. Use that. It's fair more easier.
import pandas as pd
df = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0]
Output:
print(df)
Équipe # JC Âge Poss MJ ... xG.1 xA.1 xG xA npxG.1 npxG xA.1
0 Ajaccio 18 29.1 34.5 2 ... 0.59 0.14 0.73 0.20 0.34
1 Angers 18 26.8 55.0 2 ... 1.00 0.49 1.49 1.00 1.49
2 Auxerre 15 29.4 39.5 2 ... 0.43 0.43 0.85 0.43 0.85
3 Brest 18 26.8 42.5 2 ... 0.63 0.23 0.86 0.23 0.47
4 Clermont Foot 18 27.8 48.5 2 ... 0.17 0.07 0.24 0.17 0.24
5 Lens 16 26.2 63.0 2 ... 1.48 0.94 2.41 1.08 2.02
6 Lille 18 27.2 65.0 2 ... 2.02 1.65 3.66 2.02 3.66
7 Lorient 14 25.8 36.0 1 ... 0.37 0.26 0.63 0.37 0.63
8 Lyon 15 26.0 68.0 1 ... 1.52 0.49 2.00 0.73 1.22
9 Marseille 17 26.9 55.0 2 ... 1.10 0.89 1.99 1.10 1.99
10 Monaco 19 24.8 40.5 2 ... 2.75 1.21 3.96 2.36 3.57
11 Montpellier 19 25.5 47.5 2 ... 0.93 0.66 1.59 0.93 1.59
12 Nantes 16 26.9 40.5 2 ... 1.37 0.60 1.97 1.37 1.97
13 Nice 18 25.9 54.0 2 ... 0.49 0.40 0.88 0.49 0.88
14 Paris S-G 18 27.6 60.0 2 ... 3.05 1.76 4.81 2.27 4.03
15 Reims 18 24.5 43.0 2 ... 0.54 0.42 0.96 0.54 0.96
16 Rennes 17 26.3 65.0 2 ... 1.86 1.15 3.01 1.86 3.01
17 Strasbourg 18 28.8 49.5 2 ... 0.60 0.57 1.17 0.60 1.17
18 Toulouse 18 25.6 57.0 2 ... 0.58 0.58 1.15 0.58 1.15
19 Troyes 16 26.7 39.0 2 ... 0.91 0.23 1.14 0.52 0.75
[20 rows x 29 columns]
To print just the Age columns: print(df['Âge'])