Every football players wikipedia page has something named "infobox" where the career is displayed.
My goal is to scrape only the highlighted data from wikipedia pages of football players.
I have gotten this far, im able to output the "infobox" segment of the player in text like this. But the only information I want is the highlighted one.
How do I narrow the result so I only get the highlighted text as my output?
If you feel like you might now the answer please ask questions if necessary because I feel like it is hard to formulate my question good.
CodePudding user response:
The infobox table is a succession of <tr></tr
tags.
Globally we are looking for the <tr></tr
tag located immediately after the one whose text is "Seniorlag*"
You could do it like this:
import requests
from bs4 import BeautifulSoup
url = "https://sv.wikipedia.org/wiki/Lionel_Messi"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
infobox = soup.find('table', {'class': 'infobox'})
tr_tags = infobox.find_all('tr')
for tr in tr_tags:
if tr.text == "Seniorlag*":
# Search for the following tr tag
next_tr = tr.find_next_sibling('tr')
print(next_tr.text)
output
År2003–20042004–20052004–20212021–
Klubb Barcelona C Barcelona B Barcelona Paris Saint-Germain
SM (GM) 10 00(5)22 00(6)520 (474)39 0(13)
CodePudding user response:
Just in addition to approach of @Vincent Lagache, that answers the question well, you could also deal with css selectors
(more) to find your elements:
soup.select_one('tr:has(th:-soup-contains("Seniorlag")) tr').text
Invoke dict comprehension
and stripped_strings
to extract the strings:
{
list(e.stripped_strings)[0]:list(e.stripped_strings)[1:]
for e in soup.select('tr:has(th:-soup-contains("Seniorlag")) tr table td')
}
This results in a dict
that on the one hand is already structured and can therefore be easily reused, for example creating a Dataframe
{'År': ['2003–2004', '2004–2005', '2004–2021', '2021–'], 'Klubb': ['Barcelona C', 'Barcelona B', 'Barcelona', 'Paris Saint-Germain'], 'SM (GM)': ['10', '(5)', '22', '(6)', '520 (474)', '39', '(13)']}
Example
This example also includes some pre- and postprocessing steps like decompose()
to eliminate unwanted tags and splitting column with tuples with pandas
import requests
import pandas as pd
from bs4 import BeautifulSoup
url='https://sv.wikipedia.org/wiki/Lionel_Messi'
soup = BeautifulSoup(requests.get(url).text)
for hidden in soup.find_all(style='visibility:hidden;color:transparent;'):
hidden.decompose()
d = {
list(e.stripped_strings)[0]:list(e.stripped_strings)[1:]
for e in soup.select('tr:has(th:-soup-contains("Seniorlag")) tr table td')
}
d['SM (GM)'] = ' '.join(d['SM (GM)']).split()
d['SM (GM)'] = list(zip(d['SM (GM)'][0::2], d['SM (GM)'][1::2]))
df = pd.DataFrame(d)
df[['SM', 'GM']] = pd.DataFrame(df['SM (GM)'].tolist(), index=df.index)
df
Output
År | Klubb | SM (GM) | SM | GM | |
---|---|---|---|---|---|
0 | 2003–2004 | Barcelona C | ('10', '(5)') | 10 | (5) |
1 | 2004–2005 | Barcelona B | ('22', '(6)') | 22 | (6) |
2 | 2004–2021 | Barcelona | ('520', '(474)') | 520 | (474) |
3 | 2021– | Paris Saint-Germain | ('39', '(13)') | 39 | (13) |