I'm confronted with a challenge, that I can't solve by myself. I try to explain it as good as possible (since I'm a bloody beginner in python and web scraping, this might work not so well, so sorry in advance):
I want to get the text "Herr Max Mustermann" from the website. This text is changing from site to site. My plan was to search for the word "Position", which is stable from site to site and than get the next words. Is that possible? Since I'm a beginner, I have no clue of the syntax needed or how to search for potential solutions.
Here is an html example:
Privatperson
<br>
<strong>Position</strong>
<br>
Herr Max Mustermann
<br>
Privatperson
<br>
And here the code written so far:
from bs4 import BeautifulSoup
import requests
from lxml import etree
url = "https://www.firmenabc.at/austrian-airlines-ag_EES"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
dom = etree.HTML(str(soup))
Vorstand = (dom.xpath('//*[text()="Vorstand"]')[0].text)
print(Vorstand)
Thanks so much for your support! Cheers Peter Silie
CodePudding user response:
You could use string argument to search for exact pattern and .find_next_sibling(string=True)
navigating to your goal:
soup.find('strong', string='Vorstand').find_next_sibling(string=True).strip()
Or going with xpath
use /following-sibling::text()[1]
:
dom = etree.HTML(requests.get(url).text)
[s.strip() for s in dom.xpath('//strong[text()="Vorstand"][1]/following-sibling::text()[1]')]
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.firmenabc.at/austrian-airlines-ag_EES"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.find('strong', string='Vorstand').find_next_sibling(string=True).strip()