Home > Enterprise >  How to narrow down the soup.find result and output only relevant text?
How to narrow down the soup.find result and output only relevant text?

Time:01-08

Screenshot of the output i get from my code.

Every football players wikipedia page has something named "infobox" where the career is displayed.

My goal is to scrape only the highlighted data from wikipedia pages of football players.

I have gotten this far, im able to output the "infobox" segment of the player in text like this. But the only information I want is the highlighted one.

How do I narrow the result so I only get the highlighted text as my output?

If you feel like you might now the answer please ask questions if necessary because I feel like it is hard to formulate my question good.

CodePudding user response:

The infobox table is a succession of <tr></tr tags.
Globally we are looking for the <tr></tr tag located immediately after the one whose text is "Seniorlag*"

You could do it like this:

import requests
from bs4 import BeautifulSoup

url = "https://sv.wikipedia.org/wiki/Lionel_Messi"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

infobox = soup.find('table', {'class': 'infobox'})

tr_tags = infobox.find_all('tr')

for tr in tr_tags:
    if tr.text == "Seniorlag*":
        # Search for the following tr tag
        next_tr = tr.find_next_sibling('tr')
        print(next_tr.text)

output

År2003–20042004–20052004–20212021–

Klubb Barcelona C Barcelona B Barcelona Paris Saint-Germain

SM (GM) 10 00(5)22 00(6)520 (474)39 0(13)

CodePudding user response:

Just in addition to approach of @Vincent Lagache, that answers the question well, you could also deal with css selectors (more) to find your elements:

soup.select_one('tr:has(th:-soup-contains("Seniorlag"))   tr').text

Invoke dict comprehension and stripped_strings to extract the strings:

{
    list(e.stripped_strings)[0]:list(e.stripped_strings)[1:] 
    for e in soup.select('tr:has(th:-soup-contains("Seniorlag"))   tr table td')
}

This results in a dict that on the one hand is already structured and can therefore be easily reused, for example creating a Dataframe

{'År': ['2003–2004', '2004–2005', '2004–2021', '2021–'], 'Klubb': ['Barcelona C', 'Barcelona B', 'Barcelona', 'Paris Saint-Germain'], 'SM (GM)': ['10', '(5)', '22', '(6)', '520 (474)', '39', '(13)']}

Example

This example also includes some pre- and postprocessing steps like decompose() to eliminate unwanted tags and splitting column with tuples with pandas

import requests
import pandas as pd
from bs4 import BeautifulSoup

url='https://sv.wikipedia.org/wiki/Lionel_Messi'
soup = BeautifulSoup(requests.get(url).text)

for hidden in soup.find_all(style='visibility:hidden;color:transparent;'):
    hidden.decompose()

d = {
    list(e.stripped_strings)[0]:list(e.stripped_strings)[1:] 
    for e in soup.select('tr:has(th:-soup-contains("Seniorlag"))   tr table td')
    }

d['SM (GM)'] = ' '.join(d['SM (GM)']).split()
d['SM (GM)'] = list(zip(d['SM (GM)'][0::2], d['SM (GM)'][1::2]))

df = pd.DataFrame(d)
df[['SM', 'GM']] = pd.DataFrame(df['SM (GM)'].tolist(), index=df.index)

df

Output

År Klubb SM (GM) SM GM
0 2003–2004 Barcelona C ('10', '(5)') 10 (5)
1 2004–2005 Barcelona B ('22', '(6)') 22 (6)
2 2004–2021 Barcelona ('520', '(474)') 520 (474)
3 2021– Paris Saint-Germain ('39', '(13)') 39 (13)
  • Related