I am doing my first steps with Selenium in Python and want to extract a certain value from a webpage. The value i need to find on the webpage is the ID (Melde-ID), which is 355460. In the html i found the 2 lines containing my info:
<h3 _ngcontent-wwf-c32="" > Melde-ID: 355460 </h3><span _ngcontent-wwf-c32="">
<div _ngcontent-wwf-c27="" > Melde-ID </div><div _ngcontent-wwf-c27="" >
I have been searching websites for about 2 hours for what command to use but i don't know what to actually search for in the html. The website is a html with .js modules. It works to open the URL over selenium.
(At first i tried using beautifulsoup but was not able to open the page for some restriction. I did verify that the robots.txt does not disallow anything, but the error on beautifulsoup was "Unfortunately, a problem occurred while forwarding your request to the backend server".)
I would be thankful for any advice and hope i did explain my issue. The code i tried to create in Jupyter Notebook with Selenium installed is as follows:
from selenium import webdriver
import codecs
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
url = "https://...."
driver = webdriver.Chrome('./chromedriver')
driver.implicitly_wait(0.5)
#maximize browser
driver.maximize_window()
#launch URL
driver.get(url)
#print(driver.page_source)
#Try 2
#print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[normalize-space()='Melde-ID']")))])
#close browser
driver.quit()
CodePudding user response:
From the information you shared here we can see that the element containing the desired information doesn't have class name attribute with a value of Melde-ID
.
It has class name with value of title
and contains text Melde-ID
.
Also, you should use webdriver wait expected condition instead of driver.implicitly_wait(0.5)
.
With these changes your code can be something like this:
from selenium import webdriver
import codecs
import os
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
url = "https://...."
driver = webdriver.Chrome('./chromedriver')
wait = WebDriverWait(driver, 20)
#maximize browser
driver.maximize_window()
#launch URL
driver.get(url)
content = wait.until(EC.visibility_of_element_located((By.XPATH, "//*[contains(@class,'title') and contains(.,'Melde-ID:')]"))).text
I added .text
to extract the text from that web element.
Now content
should contain Melde-ID: 355460
value.
CodePudding user response:
Given the HTML:
<h3 _ngcontent-wwf-c32="" > Melde-ID: 355460 </h3>
<span _ngcontent-wwf-c32="">
<div _ngcontent-wwf-c27="" > Melde-ID </div>
<div _ngcontent-wwf-c27="" >
To extract the text 355460 you need to induce WebDriverWait for the visibility_of_element_located() and extracting the text you have to split the text with respect to the :
character and print the second part using either of the following locator strategies:
Using CSS_SELECTOR and text attribute:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h3.title"))).text.split(':')[1])
Using XPATH and
get_attribute("innerHTML")
:print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[@class='title' and text()]"))).get_attribute("innerHTML").split(':')[1])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python