Home > Blockchain >  Find text with find_next_sibling, if it is sometimes hyperlinked and sometimes not
Find text with find_next_sibling, if it is sometimes hyperlinked and sometimes not

Time:01-26

This question is based on a similar question of mine (Search for a Word on website and get the next words in return)

I want to get the text "Herr Max Mustermann" from the website. This text is changing from site to site. My plan was to search for the word "Position", which is stable from site to site and than get the next words (solution in the mentioned question above). Sometimes the text ""Herr Max Mustermann"" is marked with an hyperlink, so that I only get an empty output.

<br>
<strong>Geschäftsführer</strong>
<br>
<a  data toggle="modal" href="https://www.firmenabc.at/person/mustermann-max_jhgxzd" data-            target="#shareholder">
    Herr Max Mustermann
    <i 
        ::before
    </i>
</a>
<br>
Privatperson

My idea would be to include an if loop:

if the next sibling of soup.find('strong', string='Vorstand') contains an a tag:
    ceo = return the text from it's next sibling
else:
    ceo= soup.find('strong', string='Vorstand').find_next_sibling(string=True).strip()

Any ideas how to code it?

CodePudding user response:

There are several options to deal with that issue, two of them described:

  • Use decompose() to remove all the br and use the approach of @BEK (without decomposing it won't find the a cause next element is a br)

  • Select your elements more specific so that you can start directly from the sibling br of the strong

Example

Used css selectors here:

from bs4 import BeautifulSoup
import requests

url = "https://www.firmenabc.at/austrian-airlines-ag_EES"
soup = BeautifulSoup(requests.get(url).text)

for e in soup.select('strong:-soup-contains("Aufsichtsrat")   br'):
    if e.find_next_sibling().name == 'a':
        print(e.find_next_sibling('a').text.strip())
    else:
        print(e.find_next_sibling(text=True).strip())

CodePudding user response:

You can use find_next_sibling() for check if the next sibling of strong tag with the text Vorstand contains an a tag.

    ceo_tag = soup.find('strong', string='Vorstand')
next_sibling = ceo_tag.find_next_sibling()

if next_sibling.name == 'a':
    ceo = next_sibling.text
else:
    ceo = ceo_tag.find_next_sibling(string=True).strip()

You can also use next_element instead of find_next_sibling to get the next element after the 'strong' tag.

  • Related