Home > Software engineering >  Get value from a website using selenium in python
Get value from a website using selenium in python

Time:08-18

I am doing my first steps with Selenium in Python and want to extract a certain value from a webpage. The value i need to find on the webpage is the ID (Melde-ID), which is 355460. In the html i found the 2 lines containing my info:

<h3 _ngcontent-wwf-c32="" > Melde-ID: 355460 </h3><span _ngcontent-wwf-c32="">
<div _ngcontent-wwf-c27="" > Melde-ID </div><div _ngcontent-wwf-c27="" >

I have been searching websites for about 2 hours for what command to use but i don't know what to actually search for in the html. The website is a html with .js modules. It works to open the URL over selenium.

(At first i tried using beautifulsoup but was not able to open the page for some restriction. I did verify that the robots.txt does not disallow anything, but the error on beautifulsoup was "Unfortunately, a problem occurred while forwarding your request to the backend server".)

I would be thankful for any advice and hope i did explain my issue. The code i tried to create in Jupyter Notebook with Selenium installed is as follows:

from selenium import webdriver
import codecs
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

url = "https://...."
driver = webdriver.Chrome('./chromedriver')
driver.implicitly_wait(0.5)
#maximize browser
driver.maximize_window()
#launch URL
driver.get(url)
#print(driver.page_source)
#Try 2
#print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[normalize-space()='Melde-ID']")))])
#close browser
driver.quit()

CodePudding user response:

From the information you shared here we can see that the element containing the desired information doesn't have class name attribute with a value of Melde-ID.
It has class name with value of title and contains text Melde-ID.
Also, you should use webdriver wait expected condition instead of driver.implicitly_wait(0.5).
With these changes your code can be something like this:

from selenium import webdriver
import codecs
import os
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

url = "https://...."
driver = webdriver.Chrome('./chromedriver')

wait = WebDriverWait(driver, 20)

#maximize browser
driver.maximize_window()
#launch URL
driver.get(url)

content = wait.until(EC.visibility_of_element_located((By.XPATH, "//*[contains(@class,'title') and contains(.,'Melde-ID:')]"))).text

I added .text to extract the text from that web element.
Now content should contain Melde-ID: 355460 value.

CodePudding user response:

Given the HTML:

<h3 _ngcontent-wwf-c32="" > Melde-ID: 355460 </h3>
<span _ngcontent-wwf-c32="">
    <div _ngcontent-wwf-c27="" > Melde-ID </div>
    <div _ngcontent-wwf-c27="" >

To extract the text 355460 you need to induce WebDriverWait for the visibility_of_element_located() and extracting the text you have to split the text with respect to the : character and print the second part using either of the following locator strategies:

  • Using CSS_SELECTOR and text attribute:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h3.title"))).text.split(':')[1])
    
  • Using XPATH and get_attribute("innerHTML"):

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[@class='title' and text()]"))).get_attribute("innerHTML").split(':')[1])
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

  • Related