Home > Software engineering >  Search for a Word on website and get the next words in return
Search for a Word on website and get the next words in return

Time:01-27

I'm confronted with a challenge, that I can't solve by myself. I try to explain it as good as possible (since I'm a bloody beginner in python and web scraping, this might work not so well, so sorry in advance):

I want to get the text "Herr Max Mustermann" from the website. This text is changing from site to site. My plan was to search for the word "Position", which is stable from site to site and than get the next words. Is that possible? Since I'm a beginner, I have no clue of the syntax needed or how to search for potential solutions.

Here is an html example:

Privatperson
<br>
<strong>Position</strong>
<br>
Herr Max Mustermann
<br>
Privatperson
<br>

And here the code written so far:

from bs4 import BeautifulSoup
import requests
from lxml import etree

url = "https://www.firmenabc.at/austrian-airlines-ag_EES"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
dom = etree.HTML(str(soup))
Vorstand = (dom.xpath('//*[text()="Vorstand"]')[0].text)
print(Vorstand)

Thanks so much for your support! Cheers Peter Silie

CodePudding user response:

You could use string argument to search for exact pattern and .find_next_sibling(string=True) navigating to your goal:

soup.find('strong', string='Vorstand').find_next_sibling(string=True).strip()

Or going with xpath use /following-sibling::text()[1]:

dom = etree.HTML(requests.get(url).text)
[s.strip() for s in dom.xpath('//strong[text()="Vorstand"][1]/following-sibling::text()[1]')]

Example

from bs4 import BeautifulSoup
import requests

url = "https://www.firmenabc.at/austrian-airlines-ag_EES"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

soup.find('strong', string='Vorstand').find_next_sibling(string=True).strip()
  • Related