BeautifulSoup Only Retrieves 1st element-CodePudding

I'm currently trying to webscrap some website.

Here is part of my code:

import pandas as pd
from bs4 import BeautifulSoup
import requests


#Get the source code
r = requests.get('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard')
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())

for a in soup:
    print (soup.find("td", {"data-stat" : "avg_age"}).text)

Basically, I have the whole source code inside "soup". However, when I call elements such as "td", {"data-stat" : "avg_age"} I only get repeated result of the first row {"data-row":"0"} as an output:

29.1
29.1
29.1
29.1
29.1

So here are my questions:

-> Why my code is stuck in first row while there is no preselection in my "soup" variable ?

-> Is there a way to make a loop that could check all the wanted elements for a different row each time ? "data-row":"0" to "data-row":19 for instance.

Thanks for your support and have a great day !

CodePudding user response：

It's stuck in the first row for a couple of reasons:

you are using the .find() which only returns the first element it "finds" in the html soup object.
You never iterate through anything. soup.find("td", {"data-stat" : "avg_age"}).text will always return the same thing. Look at your loop.

Essentially this would be the same logic as you have there:

for x in [1, 2, 3, 4]:
    print(1)

As it iterates through that list, it's just going to print 1 and you will get the 1 4 times in your console.

You need to get all the rows in soup with soup.find_all('tr'). Then when you iterate, if there is a <td> class with attribute data-stat="avg_age", only then do you want to .find() it and get the text.

import pandas as pd
from bs4 import BeautifulSoup
import requests


#Get the source code
r = requests.get('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard')
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())


rows = soup.find_all('tr')
for a in rows:
    if a.find("td", {"data-stat" : "avg_age"}):
        print (a.find("td", {"data-stat" : "avg_age"}).text)

Output:

29.1
26.8
29.4
26.8
27.8
26.2
27.2
25.8
26.0
26.9
24.8
25.5
26.9
25.9
27.6
24.5
26.3
28.8
25.6
26.7
26.1
28.2
26.9
26.6
26.0
27.7
28.0
26.8
29.9
25.5
27.1
27.1
27.1
27.2
27.0
27.0
25.1
25.8
25.9
25.8

Just as note, pandas' .read_html() uses bs4 under the hood to parse <table> tags. Use that. It's fair more easier.

import pandas as pd

df = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0]

Output:

print(df)
           Équipe  # JC   Âge  Poss  MJ  ...  xG.1  xA.1  xG xA  npxG.1  npxG xA.1
0         Ajaccio    18  29.1  34.5   2  ...  0.59  0.14   0.73    0.20       0.34
1          Angers    18  26.8  55.0   2  ...  1.00  0.49   1.49    1.00       1.49
2         Auxerre    15  29.4  39.5   2  ...  0.43  0.43   0.85    0.43       0.85
3           Brest    18  26.8  42.5   2  ...  0.63  0.23   0.86    0.23       0.47
4   Clermont Foot    18  27.8  48.5   2  ...  0.17  0.07   0.24    0.17       0.24
5            Lens    16  26.2  63.0   2  ...  1.48  0.94   2.41    1.08       2.02
6           Lille    18  27.2  65.0   2  ...  2.02  1.65   3.66    2.02       3.66
7         Lorient    14  25.8  36.0   1  ...  0.37  0.26   0.63    0.37       0.63
8            Lyon    15  26.0  68.0   1  ...  1.52  0.49   2.00    0.73       1.22
9       Marseille    17  26.9  55.0   2  ...  1.10  0.89   1.99    1.10       1.99
10         Monaco    19  24.8  40.5   2  ...  2.75  1.21   3.96    2.36       3.57
11    Montpellier    19  25.5  47.5   2  ...  0.93  0.66   1.59    0.93       1.59
12         Nantes    16  26.9  40.5   2  ...  1.37  0.60   1.97    1.37       1.97
13           Nice    18  25.9  54.0   2  ...  0.49  0.40   0.88    0.49       0.88
14      Paris S-G    18  27.6  60.0   2  ...  3.05  1.76   4.81    2.27       4.03
15          Reims    18  24.5  43.0   2  ...  0.54  0.42   0.96    0.54       0.96
16         Rennes    17  26.3  65.0   2  ...  1.86  1.15   3.01    1.86       3.01
17     Strasbourg    18  28.8  49.5   2  ...  0.60  0.57   1.17    0.60       1.17
18       Toulouse    18  25.6  57.0   2  ...  0.58  0.58   1.15    0.58       1.15
19         Troyes    16  26.7  39.0   2  ...  0.91  0.23   1.14    0.52       0.75

[20 rows x 29 columns]

To print just the Age columns: print(df['Âge'])