Home > database >  Using Python & Selenium, how to extract the text from HTML containing the <p> tag
Using Python & Selenium, how to extract the text from HTML containing the <p> tag

Time:02-17

This I know is a very simple question. I'm quite sick and trying to finish up this presentation and my brain just doesn't seem to be working right.

The HTML code is as follows:

<p>id="script_id">1</p>

CodePudding user response:

If your text is id="script_id">1 then you may use the below:

x = driver.find_element(By.TAG_NAME, 'p').text
print(x)

Output:

id="script_id">1

Process finished with exit code 0

But note that p may occur in many places and just not this line, and hence relying on just the p tag is not advisable at all in a larger picture. If it is just the purpose of this line, it would be ok, but in an application. You may have to look for some other connecting things and build a locator using all of them.

If somehow, the line you provided is faulty, i.e., if the script_id is indeed the attribute of p, i.e., <p id="script_id">1 Then, this would do:

x = driver.find_element(By.ID, 'script_id').text
print(x)

CodePudding user response:

The HTML in it's current form is invalid and ideally it should have been:

<p id="script_id">1</p>

To locate the element with text as 1 you can use either of the following Locator Strategies:

  • Using id:

    element = driver.find_element(By.ID, "script_id")
    
  • Using css_selector:

    element = driver.find_element(By.CSS_SELECTOR, "p#script_id")
    
  • Using xpath:

    element = driver.find_element(By.XPATH, "//p[@id='script_id']")
    

To extract the text 1 from the element ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

  • Using ID and text attribute:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.ID, "script_id"))).text)
    
  • Using CSS_SELECTOR and get_attribute("innerText"):

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "p#script_id"))).get_attribute("innerText"))
    
  • Using XPATH and get_attribute("innerHTML"):

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//p[@id='script_id']"))).get_attribute("innerHTML"))
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python


References

Link to useful documentation:

  • Related