How to narrow down the soup.find result and output only relevant text?-CodePudding

Every football players wikipedia page has something named "infobox" where the career is displayed.

My goal is to scrape only the highlighted data from wikipedia pages of football players.

I have gotten this far, im able to output the "infobox" segment of the player in text like this. But the only information I want is the highlighted one.

How do I narrow the result so I only get the highlighted text as my output?

If you feel like you might now the answer please ask questions if necessary because I feel like it is hard to formulate my question good.

CodePudding user response：

The infobox table is a succession of <tr></tr tags.
Globally we are looking for the <tr></tr tag located immediately after the one whose text is "Seniorlag*"

You could do it like this:

import requests
from bs4 import BeautifulSoup

url = "https://sv.wikipedia.org/wiki/Lionel_Messi"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

infobox = soup.find('table', {'class': 'infobox'})

tr_tags = infobox.find_all('tr')

for tr in tr_tags:
    if tr.text == "Seniorlag*":
        # Search for the following tr tag
        next_tr = tr.find_next_sibling('tr')
        print(next_tr.text)

output

År2003–20042004–20052004–20212021–

Klubb Barcelona C Barcelona B Barcelona Paris Saint-Germain

SM (GM) 10 00(5)22 00(6)520 (474)39 0(13)

CodePudding user response：

Just in addition to approach of @Vincent Lagache, that answers the question well, you could also deal with css selectors (more) to find your elements:

soup.select_one('tr:has(th:-soup-contains("Seniorlag"))   tr').text

Invoke dict comprehension and stripped_strings to extract the strings:

{
    list(e.stripped_strings)[0]:list(e.stripped_strings)[1:] 
    for e in soup.select('tr:has(th:-soup-contains("Seniorlag"))   tr table td')
}

This results in a dict that on the one hand is already structured and can therefore be easily reused, for example creating a Dataframe

{'År': ['2003–2004', '2004–2005', '2004–2021', '2021–'], 'Klubb': ['Barcelona C', 'Barcelona B', 'Barcelona', 'Paris Saint-Germain'], 'SM (GM)': ['10', '(5)', '22', '(6)', '520 (474)', '39', '(13)']}

Example

This example also includes some pre- and postprocessing steps like decompose() to eliminate unwanted tags and splitting column with tuples with pandas

import requests
import pandas as pd
from bs4 import BeautifulSoup

url='https://sv.wikipedia.org/wiki/Lionel_Messi'
soup = BeautifulSoup(requests.get(url).text)

for hidden in soup.find_all(style='visibility:hidden;color:transparent;'):
    hidden.decompose()

d = {
    list(e.stripped_strings)[0]:list(e.stripped_strings)[1:] 
    for e in soup.select('tr:has(th:-soup-contains("Seniorlag"))   tr table td')
    }

d['SM (GM)'] = ' '.join(d['SM (GM)']).split()
d['SM (GM)'] = list(zip(d['SM (GM)'][0::2], d['SM (GM)'][1::2]))

df = pd.DataFrame(d)
df[['SM', 'GM']] = pd.DataFrame(df['SM (GM)'].tolist(), index=df.index)

df

Output

	År	Klubb	SM (GM)	SM	GM
0	2003–2004	Barcelona C	('10', '(5)')	10	(5)
1	2004–2005	Barcelona B	('22', '(6)')	22	(6)
2	2004–2021	Barcelona	('520', '(474)')	520	(474)
3	2021–	Paris Saint-Germain	('39', '(13)')	39	(13)